- Clip scanning (2s windows) over a directory of videos →
clips.csv
- CLIP embeddings (ViT-B/32 via open-clip) averaged across frames per clip →
embeddings.npy
- FAISS HNSW (L2 over L2-normalized features) →
faiss.index
- Forward-only kNN temporal-semantic graph →
edges.csv
- A* planning from a current-view image to a language goal →
storyboard.png
- Create venv and install deps
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
- Prepare your data
- Put videos (mp4, mov, avi, mkv) under
data/videos/.
- Put a current viewpoint image at
data/current.jpg (or pass a path).
- Build storyboard graph
# 3.1 Scan videos into 2s clips
python -m storyboard_rl.cli scan \
--video_dir data/videos \
--out_manifest outputs/clips.csv \
--clip_seconds 2.0 --stride_seconds 2.0
# 3.2 Embed clips with CLIP (averaged frames per clip)
python -m storyboard_rl.cli embed \
--manifest outputs/clips.csv \
--out_embeddings outputs/embeddings.npy \
--out_clip_index outputs/clip_index.json \
--frames_per_clip 8
# 3.3 Build FAISS HNSW index
python -m storyboard_rl.cli index \
--embeddings outputs/embeddings.npy \
--out_faiss outputs/faiss.index
# 3.4 Build forward-only kNN graph edges
python -m storyboard_rl.cli graph \
--manifest outputs/clips.csv \
--embeddings outputs/embeddings.npy \
--faiss_index outputs/faiss.index \
--out_edges outputs/edges.csv \
--k 10 --candidates 100
- Plan from current view to text goal and export storyboard
python -m storyboard_rl.cli plan \
--manifest outputs/clips.csv \
--embeddings outputs/embeddings.npy \
--faiss_index outputs/faiss.index \
--edges outputs/edges.csv \
--current_image data/current.jpg \
--instruction "put the red mug in the sink" \
--out_storyboard outputs/storyboard.png
- Runs on CPU by default; uses MPS/CUDA if available.
- Cosine similarity via L2-normalization + inner product.
- HNSW uses L2 which is equivalent for ranking when normalized.
- Minimal scaffold for scale, add sharding, caching, and PQ compression.