GitHub - gowravthota/Clip-based-navigation: Zero-shot task execution by CLIP-FAISS sub-goal retrieval

Features

Clip scanning (2s windows) over a directory of videos → clips.csv
CLIP embeddings (ViT-B/32 via open-clip) averaged across frames per clip → embeddings.npy
FAISS HNSW (L2 over L2-normalized features) → faiss.index
Forward-only kNN temporal-semantic graph → edges.csv
A* planning from a current-view image to a language goal → storyboard.png

Quickstart

Create venv and install deps

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Prepare your data

Put videos (mp4, mov, avi, mkv) under data/videos/.
Put a current viewpoint image at data/current.jpg (or pass a path).

Build storyboard graph

# 3.1 Scan videos into 2s clips
python -m storyboard_rl.cli scan \
  --video_dir data/videos \
  --out_manifest outputs/clips.csv \
  --clip_seconds 2.0 --stride_seconds 2.0

# 3.2 Embed clips with CLIP (averaged frames per clip)
python -m storyboard_rl.cli embed \
  --manifest outputs/clips.csv \
  --out_embeddings outputs/embeddings.npy \
  --out_clip_index outputs/clip_index.json \
  --frames_per_clip 8

# 3.3 Build FAISS HNSW index
python -m storyboard_rl.cli index \
  --embeddings outputs/embeddings.npy \
  --out_faiss outputs/faiss.index

# 3.4 Build forward-only kNN graph edges
python -m storyboard_rl.cli graph \
  --manifest outputs/clips.csv \
  --embeddings outputs/embeddings.npy \
  --faiss_index outputs/faiss.index \
  --out_edges outputs/edges.csv \
  --k 10 --candidates 100

Plan from current view to text goal and export storyboard

python -m storyboard_rl.cli plan \
  --manifest outputs/clips.csv \
  --embeddings outputs/embeddings.npy \
  --faiss_index outputs/faiss.index \
  --edges outputs/edges.csv \
  --current_image data/current.jpg \
  --instruction "put the red mug in the sink" \
  --out_storyboard outputs/storyboard.png

Notes

Runs on CPU by default; uses MPS/CUDA if available.
Cosine similarity via L2-normalization + inner product.
HNSW uses L2 which is equivalent for ranking when normalized.
Minimal scaffold for scale, add sharding, caching, and PQ compression.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Quickstart

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
outputs		outputs
storyboard_rl		storyboard_rl
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Features

Quickstart

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages