Skip to content

Ashrith5321/EvManip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

EvManip: A Foundation Model for Robot Manipulation using Event Cameras

Status Python Framework

The first self-supervised foundation model that learns manipulation representations from event cameras, transferring across multiple contact-rich robot manipulation tasks.

Paper · Project Page · Dataset · Pretrained Models


Overview

Existing manipulation foundation models (R3M, MVP, MCR) rely entirely on RGB or RGB-D inputs — they fail under fast motion and low-light conditions common in real manipulation scenarios. Event cameras offer microsecond temporal resolution and high dynamic range, but all prior event-camera work trains task-specific models from scratch with no transferable representations.

EvManip fills this gap. We pretrain a cross-modal encoder on large-scale event + RGB-D manipulation data using self-supervised objectives that require zero human labels. The pretrained encoder transfers to multiple downstream manipulation tasks, outperforming RGB-only baselines — especially in low-data, fast-motion, and low-light regimes.

RGB-D Frame   →  ViT Encoder  ──────────────┐
                                             ├──► Cross-Modal Fusion → Manipulation Representation
Event Stream  →  Spatiotemporal             │
                 Event Encoder  ────────────┘
                      ↓
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
  Grasp Success   Slip Detection  Contact Timing

Key Contributions

  • First foundation model combining event cameras and RGB-D for manipulation representation learning
  • Three self-supervised pretraining objectives requiring zero human annotation:
    • Contact Prediction — learn contact dynamics from event spikes at touch moments
    • Cross-Modal Consistency — align event and RGB-D latent spaces contrastively
    • Motion Forecasting — predict future event distributions from current stream
  • Transfers across tasks — one pretrained encoder, multiple downstream manipulation tasks
  • Data pipeline — v2e-based synthetic event generation from any RGB-D manipulation dataset

Why Event Cameras for Manipulation?

Challenge RGB-D Event Camera
Fast gripper motion Motion blur Microsecond resolution, no blur
Low-light workspace Noisy / fails High dynamic range (120dB)
Contact detection latency Next frame (~33ms) Sub-millisecond response
Slip detection Misses incipient slip Fires instantly on object movement

Results (Preliminary)

Evaluating pretrained encoder (frozen) fine-tuned on downstream tasks with limited labels:(placeholder)

Task Train from Scratch DINOv2 (RGB) MEM (Events only) EvManip (Ours)
Grasp Success Prediction 61.2% 74.8% 68.3% 81.4%
Slip Detection 54.7% 63.1% 71.2% 79.6%
Contact Moment Detection (ms error) 18.4ms 14.2ms 9.8ms 6.1ms

Results with 10% label supervision. Full results in paper.


Installation

git clone https://github.com/yourusername/evmanip.git
cd evmanip
pip install -r requirements.txt

Requirements:

  • Python 3.8+
  • PyTorch 2.0+
  • v2e (for synthetic event generation)
  • RoboMimic
pip install torch torchvision
pip install v2e
pip install robomimic

Data Pipeline

We use v2e to convert RGB-D manipulation videos to synthetic event streams. No real event camera hardware required for pretraining.

# Step 1: Download RoboMimic dataset
python scripts/download_robomimic.py --task lift square transport

# Step 2: Convert RGB videos to synthetic events
python scripts/generate_events.py \
    --input_dir data/robomimic/videos \
    --output_dir data/events \
    --threshold 0.2 \
    --noise_rate 0.01

# Step 3: Verify event quality
python scripts/visualize_events.py --sequence data/events/lift_000

Pretraining

python train_pretrain.py \
    --data_dir data/events \
    --batch_size 64 \
    --epochs 200 \
    --lr 1e-4 \
    --loss contact_pred cross_modal motion_forecast \
    --output_dir checkpoints/evmanip_pretrained

Key pretraining arguments:

Argument Default Description
--event_window_ms 10 Temporal window for event tokenization
--loss_weights 0.4 0.4 0.2 Weights for 3 pretraining losses
--fusion cross_attn Fusion module type
--freeze_rgb False Freeze RGB encoder during pretraining

Fine-tuning on Downstream Tasks

# Grasp success prediction
python train_finetune.py \
    --task grasp_success \
    --pretrained checkpoints/evmanip_pretrained \
    --label_fraction 0.1 \
    --epochs 50

# Slip detection
python train_finetune.py \
    --task slip_detection \
    --pretrained checkpoints/evmanip_pretrained \
    --label_fraction 0.1 \
    --epochs 50

# Contact moment detection
python train_finetune.py \
    --task contact_detection \
    --pretrained checkpoints/evmanip_pretrained \
    --label_fraction 0.1 \
    --epochs 50

Repository Structure

evmanip/
├── configs/                  # Training configs
│   ├── pretrain.yaml
│   └── finetune/
│       ├── grasp_success.yaml
│       ├── slip_detection.yaml
│       └── contact_detection.yaml
├── data/
│   ├── robomimic/            # RoboMimic RGB-D data
│   └── events/               # Synthetic event streams (generated)
├── models/
│   ├── event_encoder.py      # Spatiotemporal event ViT
│   ├── rgb_encoder.py        # RGB-D ViT encoder
│   ├── fusion.py             # Cross-modal attention fusion
│   └── evmanip.py            # Full EvManip model
├── losses/
│   ├── contact_prediction.py
│   ├── cross_modal.py
│   └── motion_forecast.py
├── scripts/
│   ├── download_robomimic.py
│   ├── generate_events.py
│   └── visualize_events.py
├── train_pretrain.py
├── train_finetune.py
├── evaluate.py
├── requirements.txt
└── README.md

Datasets

Dataset Usage Source
RoboMimic Pretraining (RGB-D → synthetic events) Link
MimicGen Pretraining augmentation Link
E-Grasp Real-event fine-tuning validation Link
Neuro-Grasp Real-event evaluation Link

Related Work

Event Camera Pretraining:

  • MEM: Masked Event Modeling (WACV 2024) — SSL pretraining for classification, not manipulation
  • TESPEC (ICCV 2025) — Temporal event pretraining, no manipulation tasks

Manipulation Representations:

  • R3M — RGB video pretraining for manipulation
  • MCR (ICLR 2025) — Manipulation-centric representations, RGB only

Event Cameras for Manipulation (task-specific, no pretraining):

EvManip is the first work at the intersection of all three.


Citation

@article{evmanip2025,
  title     = {EvManip: A Foundation Model for Robot Manipulation using Event Cameras},
  author    = {Your Name},
  journal   = {arXiv preprint},
  year      = {2025}
}

License

MIT License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages