Simulated Streaming Video Understanding

My mini weekend project: A POC of a system that simulates real-time video understanding using models that don't natively support video or streaming.

What It Does

Instead of analyzing a video all at once, this system processes it frame-by-frame and emits events only when something changes — mimicking live video understanding.

How It Works

Sample frames from video (e.g., every 10th frame)
Describe each frame using vision-to-text model
Extract structured state (subject, attributes, motion) via tool call
Diff against previous state deterministically
Stream narration of changes only (no repeated descriptions)

Example Output

The LEGO block changes color from red to pink.

(Silence means nothing changed)

Architecture

The system uses three separate LLM calls with distinct responsibilities:

Component	Input	Output	Purpose
Perception	Frame image	Descriptive text	Vision-to-text, no temporal reasoning
Extractor	Description	Structured JSON state	Canonical labels (`temperature=0`)
Narrator	State diff	Streamed sentence	Natural language change description

Deterministic processor (not an LLM) handles:

State persistence across frames
Diffing logic (ignores noise like phrasing variations)
Event emission decisions

State Model

{
  "subject": "lego_block",
  "attributes": {
    "color": "red"
  },
  "state": {
    "motion": "stationary"
  }
}

Only relevant fields (color, motion) participate in diffing. Shape, orientation, and size jitter is ignored.

Local Setup

Prerequisites

Python 3.12+
Mistral API key (get one here)

Installation

Clone the repository and navigate to the project directory

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Note: This includes PyTorch, transformers, and OpenCV. Installation may take a few minutes.
Create a .env file in the project root:
```
MISTRAL_API_KEY=your_api_key_here
```

Add video files to the input/ directory:

mkdir -p input
# Place your .mp4 files in input/

Running

Process a video and get streaming narration:

python process.py

By default, this processes input/sample2.mp4 with a frame step of 10. After processing, you'll enter an interactive Q&A mode where you can ask questions about what happened in the video.

To process a different video or adjust the frame sampling, modify the __main__ block in process.py:

video_context = get_stream("./input/your_video.mp4", frame_step=10)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
input		input
.gitignore		.gitignore
README.md		README.md
process.py		process.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simulated Streaming Video Understanding

What It Does

How It Works

Example Output

Architecture

State Model

Local Setup

Prerequisites

Installation

Running

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

daria425/simulated-streaming

Folders and files

Latest commit

History

Repository files navigation

Simulated Streaming Video Understanding

What It Does

How It Works

Example Output

Architecture

State Model

Local Setup

Prerequisites

Installation

Running

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages