Your Camera Doesn’t Just See — It Explains

Author(s): Lindo St. Angel

Originally published on Towards AI.

Your Camera Doesn’t Just See — It Explains — Image by the author (unknown person was me out of the scene)

Smart-home cameras are noisy storytellers. A leaf moves, a light flickers, and your phone buzzes again. What you get is motion — not meaning.

The Video Analyzer inside the Home Generative Agent integration for Home Assistant changes that.

It transforms a stream of motion snapshots into one clean, factual sentence — the kind of line you can rely on for automation without false alarms.

The bigger picture: part of Home Generative Agent

The Video Analyzer isn’t a standalone add-on. It’s a core subsystem of Home Generative Agent (HGA) — a custom Home Assistant integration that gives your home an intelligent memory and language interface.

At the architectural level:

— HGA connects multiple large models — vision, language, and embeddings under one orchestration layer.

— Each category (chat, VLM, summarization, embedding) uses pluggable providers — OpenAI, Ollama, or Gemini.

— The integration exposes structured outputs as Home Assistant entities and events, allowing automations, dashboards, and vector-based “memories” to interoperate.

In that ecosystem, the Video Analyzer is the visual limb. It speaks to cameras, invokes the vision and summarization models, and publishes the semantic result back to the system.

If the agent were a brain, the Video Analyzer is its occipital lobe — constantly turning pixels into concise, automation-ready meaning.

Why build it at all?

Motion sensors are interrupts, not meaning. They trigger on shadows, noise, or sunlight, but can’t say what happened.

What you want is a reliable abstraction — a textual state that describes the scene:

“Two packages on the porch; no person present.”
“Alice is at the front door.”
“A person picks up a box and leaves.”

That’s the promise: semantic compression — turning gigabytes of video into a single, precise sentence.

The functional flow

Trigger → Capture → Describe → Publish → Notify → Remember

Trigger: Cameras or motion sensors start a short capture loop.

Capture: snapshots enter per-camera asyncio queues, deduped via perceptual hash + heartbeat logic.

Describe:

— Heuristic path for single frames (fast, deterministic).

— Vision-Language Model (VLM) + summarizer for multi-frame context.

Publish: updates “latest” image + “recognized people” entities; fires HA events.

Notify: optional mobile push (always or on anomaly).

Remember: the vector DB stores text and metadata for long-term recall.

A quick mental model

[Camera] --> [Queue] --> [Describe (VLM + heuristic)] --> frame texts + identities
 |
 v
 [Summarizer LLM + rules] --> {summary, recognized_people}
 |
 +--> Publish entities
 +--> Notify user
 +--> Store vectors

Where the models come in

The Video Analyzer uses three model tiers, each tuned by configuration constants in const.py.

1. Vision-Language Model (VLM) — describes a single frame

Default: Ollama qwen3-vl:8b (temperature 0.2)

System prompt (core clauses):

“Produce a short, factual description (1–3 sentences).
Do NOT speculate, infer identity, or describe unseen content.
Use consistent phrasing across similar frames.
Never assume gender if unclear.”

Why: It enforces neutral, deterministic phrasing so similar images yield similar sentences — crucial for reliable embeddings and anomaly detection.

When previous frame text is available, the VLM may describe motion only if multiple cues agree; otherwise, it defaults to “stands” or “walks nearby.” This approach minimizes false motion narratives that plague consumer AI cameras.

User prompt:

“Describe this image clearly and factually in 1–3 sentences. Do not add names, timestamps, or speculation.”

A focused variant accepts keywords (Primary attention: {key_words}) when you want to bias attention toward, say, packages or doors.

The VLM only describes what’s visually present — it doesn’t recognize faces or assign identities. Person recognition happens later in the pipeline through a separate face-recognition service, which tags frames with known identities before the summarization step. This separation keeps the VLM neutral and privacy-safe while still allowing the system to link actions to recognized individuals when appropriate.

2. Summarization model — weave multiple frames into one line

Default: Ollama qwen3:8b (temperature 0.2)

System rules excerpt:

“≤150 characters, ≤2 sentences.
Narrate events in order (t+Xs).
Include up to two known names; otherwise say ‘a person’.
Treat ‘Unknown Person’ as human.
Describe visible actions only; no speculation.”

These constraints force concise, automation-safe summaries.
When multiple frames exist, the summarizer fuses them chronologically:

“Lindo St. Angel steps onto the porch, then leans on the railing.”

Suppose there’s only one frame, a heuristic tries first — cleaning text, inserting the subject if recognized, and truncating length. Only when that path fails is the LLM called.

User prompt reminder:

“Write ≤150 characters (≤2 sentences). Obey all rules and narrate in order.”

3. Embedding & chat layers — memory and reasoning

Embedding prompt:

“Represent this sentence for searching relevant passages: {query}”

This ensures every caption becomes a stable vector entry for later retrieval or novelty scoring.

Chat summarizer system message:

“You are a bot that summarizes messages from a smart home AI.”

Together, these make the HGA memory continuous: video events, text logs, and automations all share the same language space.

Prompts as contracts

Every model prompt is a contract between you and the language model. Break it, and your automations will drift. Keep it tight and your system stays explainable.

These contracts enforce:

Neutral, observable language
Short, deterministic phrasing
Stable embeddings for anomaly detection
Privacy-safe content (no gender or intent guessing)

These prompts live as versioned, editable constants — not buried in code — so you can evolve them responsibly.

How it integrates with the rest of HGA

The Video Analyzer operates as part of a coordinated visual stack inside the Home Generative Agent (HGA). It doesn’t work alone — it collaborates with a dedicated Face Recognition service and the HGA runtime to turn raw camera input into structured, searchable memory.

Video Analyzer → Face Recognition → Summarization → Runtime:
When a new frame arrives, the Video Analyzer first sends it to the Vision-Language Model (VLM) for description. The text describes the visible scene — lighting, motion, posture, and objects — but never guesses who is present. The Face Recognition module processes the same frame in parallel, which compares faces against the local database of known identities. The recognized names (if any) are attached as metadata and passed back to the Video Analyzer. The Summarization Model then fuses the frame text and the identity metadata into a single, human-readable caption, such as:
“Lindo St. Angel steps onto the porch, then leans on the railing.”
Structured output and signals:
Each summarized event is posted to the HGA runtime store — a Postgres-backed vector database — indexed under ("video_analysis", camera_name) for semantic search. The analyzer also fires structured Home Assistant signals (hga_new_latest, hga_recognized_people), which update entities and trigger automations in real time.
Notifications and image protection:
When the analyzer detects anomalies or unfamiliar faces, it can send mobile push notifications via the configured notify_service. The relevant snapshot is marked for retention to prevent it from being deleted by cleanup jobs.
Unified memory through embeddings:
Every generated caption — whether from video, chat summaries, or sensor logs — is embedded into the same vector space. This lets HGA recall or reason across modalities:
“Who came to the porch after sunset?”
“Was anyone seen before the door opened?”

By separating visual perception (VLM), identity recognition, and semantic synthesis (Summarization), the system stays modular, explainable, and privacy-aware. Each component does one job well — and together they let your home not only see, but understand.

Why this works

Heuristic before model — prioritize speed and determinism; use the LLM second.
Short summaries — 150-char max keeps embeddings and notifications fast.
Chronology— events unfold logically, preventing misordered narratives.
Neutral voice — automation triggers stay predictable.
Tight async loops — each camera runs independently; queues and semaphores to avoid overload.

The result is a camera subsystem that behaves like a colleague, not a parrot — brief, factual, reliable.

Real-world results: the Video Analyzer in action

Note: The face-recognition module inserts named individuals into the captions; the VLM itself produces person-neutral descriptions.

Figure 1 — Front porch activity detection (image by the author)

“Lindo St. Angel walks on the pathway with bags, then stands on the porch holding one.”

(A concise narrative that joins multiple frames — movement and arrival — into a single 17-word sentence suitable for automation triggers.)

Figure 2 — Maintenance activity recognition (image by the author)

“A person walks near the house, then climbs the porch steps. They carry a backpack, later hold a leaf blower, and eventually walk away, leaving the porch empty.”

(Multiple frames stitched chronologically; the summarizer keeps the verbs concrete and factual.)

Figure 3 — Groundskeeping action tracking (image by the author)

“A person in an orange shirt walks on a grassy area near a house, carrying a trash can. They bend over, pick up items, and later sweep leaves near the porch.”

(Here, the LLM combines posture, object, and sequential verbs into one compact summary — no excess detail, no guesses.)

Figure 4 — Front gate movement (image by the author)

“Lindo St. Angel stands near the gate, then walks on the pathway.”

(A short, multi-frame progression reflecting direct visual cues.)

Figure 5 — Multi-entity scene (image by the author)

“A person stands on the porch next to a dark door, holding a white object. A black cat is on the porch near a wicker chair.”

(The analyzer correctly identifies both entities — a person and a cat — without over-describing or inferring relationships.)

The bigger picture

Home Assistant handles the sensors and cameras; the Video Analyzer interprets what they see. Each event becomes a single, factual sentence — concise enough for automation, expressive enough for understanding. Those sentences are fed into the Home Generative Agent core, where they are embedded, indexed, and retained. Other agents can then search, summarize, and act on that knowledge.

The result is a closed loop: see → describe → remember → respond. Every frame that passes through becomes part of a living memory — one that can reason about what’s happening, not just record it.

A step toward meaningful perception

The Video Analyzer is more than an engineering module; it’s a small demonstration of what grounded, explainable AI can look like in a real home. By enforcing neutral language and deterministic prompts, it replaces noise with narrative and makes machine vision accountable.

Pixels become sentences. Sentences become context. Context becomes understanding. Sometimes the model drifts — but the constraints pull it back to truth and meaning.

System map — how the Video Analyzer fits inside Home Generative Agent

The Video Analyzer sits between Home Assistant’s cameras and HGA’s memory core, converting raw motion into structured language.

┌───────────────────────────────────────────────────────────────┐
│ Home Generative Agent (HGA) │
│ ┌──────────────────────────┐ ┌───────────────────────┐ │
│ │ Language + Memory Core │◄────►│ Vector Store (Postgres │
│ │ - Chat / Summarization │ │ + Embeddings Index) │
│ │ - Context Manager │ └───────────────────────┘ │
│ └──────────▲───────────────┘ │ 
│ │ │
│ │ (semantic summaries, embeddings) │
│ │ │
│ ┌──────────┴──────────────┐ │
│ │ Video Analyzer │ │
│ │ - Captures frames │ │
│ │ - Runs VLM + Summary │ │
│ │ - Detects anomalies │ │
│ │ - Publishes entities │ │
│ └──────────▲──────────────┘ │
│ │ │
│ │ (events, images, people, summaries) │
│ │ │
│ ┌──────────┴──────────────┐ │
│ │ Home Assistant Core │ │
│ │ - Cameras │ │
│ │ - Sensors │ │
│ │ - Automations │ │
│ └─────────────────────────┘ │
│ │
│ Notifications → Mobile App (via notify.mobile_app_*) │
│ Dashboards → Lovelace or custom UI │
│ Memory Search → “What happened on the porch today?” │
└───────────────────────────────────────────────────────────────┘

Home Assistant handles the sensors and cameras; the Video Analyzer interprets what they see. It turns images into language — one factual sentence per event — and feeds that text to the Home Generative Agent core. HGA then embeds and stores those sentences in its vector memory, where other agents and automations can search, summarize, or act on them. The same data powers mobile notifications, dashboards, and contextual reasoning. Together, these modules form a closed loop: see → describe → remember → respond.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Your Camera Doesn’t Just See — It Explains

Author(s): Lindo St. Angel

The bigger picture: part of Home Generative Agent

Why build it at all?

The functional flow

A quick mental model

Where the models come in

Prompts as contracts

How it integrates with the rest of HGA

Why this works

Real-world results: the Video Analyzer in action

The bigger picture

A step toward meaningful perception

System map — how the Video Analyzer fits inside Home Generative Agent

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Crack ML Interviews with Confidence: K-Nearest Neighbors (KNN 20 Q&A)

The Event-Driven Blueprint: How I Scaled a Spring Boot System to 10 Million Kafka Messages/Day

Building Vector Search? Why FAISS Alone Isn’t Enough

TAI #202: GPT-5.5 Moves Codex Into Real Work

Machine Learning System Design -The Model Serving Triangle, With One Forward Pass Flowing Through Every Trade-off (Part3)

AI Orchestration in Action: How MuleSoft and LLMs Fuel the Future of Enterprise AI

GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token.

Part 20: Data Manipulation in Multi-Dimensional Aggregation

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Your Camera Doesn’t Just See — It Explains

Author(s): Lindo St. Angel

The bigger picture: part of Home Generative Agent

Why build it at all?

The functional flow

A quick mental model

Where the models come in

Prompts as contracts

How it integrates with the rest of HGA

Why this works

Real-world results: the Video Analyzer in action

The bigger picture

A step toward meaningful perception

System map — how the Video Analyzer fits inside Home Generative Agent

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement