Skip to content

gowravthota/Clip-based-navigation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Features

  • Clip scanning (2s windows) over a directory of videos → clips.csv
  • CLIP embeddings (ViT-B/32 via open-clip) averaged across frames per clip → embeddings.npy
  • FAISS HNSW (L2 over L2-normalized features) → faiss.index
  • Forward-only kNN temporal-semantic graph → edges.csv
  • A* planning from a current-view image to a language goal → storyboard.png

Quickstart

  1. Create venv and install deps
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
  1. Prepare your data
  • Put videos (mp4, mov, avi, mkv) under data/videos/.
  • Put a current viewpoint image at data/current.jpg (or pass a path).
  1. Build storyboard graph
# 3.1 Scan videos into 2s clips
python -m storyboard_rl.cli scan \
  --video_dir data/videos \
  --out_manifest outputs/clips.csv \
  --clip_seconds 2.0 --stride_seconds 2.0

# 3.2 Embed clips with CLIP (averaged frames per clip)
python -m storyboard_rl.cli embed \
  --manifest outputs/clips.csv \
  --out_embeddings outputs/embeddings.npy \
  --out_clip_index outputs/clip_index.json \
  --frames_per_clip 8

# 3.3 Build FAISS HNSW index
python -m storyboard_rl.cli index \
  --embeddings outputs/embeddings.npy \
  --out_faiss outputs/faiss.index

# 3.4 Build forward-only kNN graph edges
python -m storyboard_rl.cli graph \
  --manifest outputs/clips.csv \
  --embeddings outputs/embeddings.npy \
  --faiss_index outputs/faiss.index \
  --out_edges outputs/edges.csv \
  --k 10 --candidates 100
  1. Plan from current view to text goal and export storyboard
python -m storyboard_rl.cli plan \
  --manifest outputs/clips.csv \
  --embeddings outputs/embeddings.npy \
  --faiss_index outputs/faiss.index \
  --edges outputs/edges.csv \
  --current_image data/current.jpg \
  --instruction "put the red mug in the sink" \
  --out_storyboard outputs/storyboard.png

Notes

  • Runs on CPU by default; uses MPS/CUDA if available.
  • Cosine similarity via L2-normalization + inner product.
  • HNSW uses L2 which is equivalent for ranking when normalized.
  • Minimal scaffold for scale, add sharding, caching, and PQ compression.

About

Zero-shot task execution by CLIP-FAISS sub-goal retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages