My mini weekend project: A POC of a system that simulates real-time video understanding using models that don't natively support video or streaming.
Instead of analyzing a video all at once, this system processes it frame-by-frame and emits events only when something changes — mimicking live video understanding.
- Sample frames from video (e.g., every 10th frame)
- Describe each frame using vision-to-text model
- Extract structured state (subject, attributes, motion) via tool call
- Diff against previous state deterministically
- Stream narration of changes only (no repeated descriptions)
The LEGO block changes color from red to pink.
(Silence means nothing changed)
The system uses three separate LLM calls with distinct responsibilities:
| Component | Input | Output | Purpose |
|---|---|---|---|
| Perception | Frame image | Descriptive text | Vision-to-text, no temporal reasoning |
| Extractor | Description | Structured JSON state | Canonical labels (temperature=0) |
| Narrator | State diff | Streamed sentence | Natural language change description |
Deterministic processor (not an LLM) handles:
- State persistence across frames
- Diffing logic (ignores noise like phrasing variations)
- Event emission decisions
{
"subject": "lego_block",
"attributes": {
"color": "red"
},
"state": {
"motion": "stationary"
}
}Only relevant fields (color, motion) participate in diffing. Shape, orientation, and size jitter is ignored.
- Python 3.12+
- Mistral API key (get one here)
-
Clone the repository and navigate to the project directory
-
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
Note: This includes PyTorch, transformers, and OpenCV. Installation may take a few minutes.
-
Create a
.envfile in the project root:MISTRAL_API_KEY=your_api_key_here
-
Add video files to the
input/directory:mkdir -p input # Place your .mp4 files in input/
Process a video and get streaming narration:
python process.pyBy default, this processes input/sample2.mp4 with a frame step of 10. After processing, you'll enter an interactive Q&A mode where you can ask questions about what happened in the video.
To process a different video or adjust the frame sampling, modify the __main__ block in process.py:
video_context = get_stream("./input/your_video.mp4", frame_step=10)