EvManip: A Foundation Model for Robot Manipulation using Event Cameras

The first self-supervised foundation model that learns manipulation representations from event cameras, transferring across multiple contact-rich robot manipulation tasks.

Paper · Project Page · Dataset · Pretrained Models

Overview

Existing manipulation foundation models (R3M, MVP, MCR) rely entirely on RGB or RGB-D inputs — they fail under fast motion and low-light conditions common in real manipulation scenarios. Event cameras offer microsecond temporal resolution and high dynamic range, but all prior event-camera work trains task-specific models from scratch with no transferable representations.

EvManip fills this gap. We pretrain a cross-modal encoder on large-scale event + RGB-D manipulation data using self-supervised objectives that require zero human labels. The pretrained encoder transfers to multiple downstream manipulation tasks, outperforming RGB-only baselines — especially in low-data, fast-motion, and low-light regimes.

RGB-D Frame   →  ViT Encoder  ──────────────┐
                                             ├──► Cross-Modal Fusion → Manipulation Representation
Event Stream  →  Spatiotemporal             │
                 Event Encoder  ────────────┘
                      ↓
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
  Grasp Success   Slip Detection  Contact Timing

Key Contributions

First foundation model combining event cameras and RGB-D for manipulation representation learning
Three self-supervised pretraining objectives requiring zero human annotation:
- Contact Prediction — learn contact dynamics from event spikes at touch moments
- Cross-Modal Consistency — align event and RGB-D latent spaces contrastively
- Motion Forecasting — predict future event distributions from current stream
Transfers across tasks — one pretrained encoder, multiple downstream manipulation tasks
Data pipeline — v2e-based synthetic event generation from any RGB-D manipulation dataset

Why Event Cameras for Manipulation?

Challenge	RGB-D	Event Camera
Fast gripper motion	Motion blur	Microsecond resolution, no blur
Low-light workspace	Noisy / fails	High dynamic range (120dB)
Contact detection latency	Next frame (~33ms)	Sub-millisecond response
Slip detection	Misses incipient slip	Fires instantly on object movement

Results (Preliminary)

Evaluating pretrained encoder (frozen) fine-tuned on downstream tasks with limited labels:(placeholder)

Task	Train from Scratch	DINOv2 (RGB)	MEM (Events only)	EvManip (Ours)
Grasp Success Prediction	61.2%	74.8%	68.3%	81.4%
Slip Detection	54.7%	63.1%	71.2%	79.6%
Contact Moment Detection (ms error)	18.4ms	14.2ms	9.8ms	6.1ms

Results with 10% label supervision. Full results in paper.

Installation

git clone https://github.com/yourusername/evmanip.git
cd evmanip
pip install -r requirements.txt

Requirements:

Python 3.8+
PyTorch 2.0+
v2e (for synthetic event generation)
RoboMimic

pip install torch torchvision
pip install v2e
pip install robomimic

Data Pipeline

We use v2e to convert RGB-D manipulation videos to synthetic event streams. No real event camera hardware required for pretraining.

# Step 1: Download RoboMimic dataset
python scripts/download_robomimic.py --task lift square transport

# Step 2: Convert RGB videos to synthetic events
python scripts/generate_events.py \
    --input_dir data/robomimic/videos \
    --output_dir data/events \
    --threshold 0.2 \
    --noise_rate 0.01

# Step 3: Verify event quality
python scripts/visualize_events.py --sequence data/events/lift_000

Pretraining

python train_pretrain.py \
    --data_dir data/events \
    --batch_size 64 \
    --epochs 200 \
    --lr 1e-4 \
    --loss contact_pred cross_modal motion_forecast \
    --output_dir checkpoints/evmanip_pretrained

Key pretraining arguments:

Argument	Default	Description
`--event_window_ms`	10	Temporal window for event tokenization
`--loss_weights`	0.4 0.4 0.2	Weights for 3 pretraining losses
`--fusion`	cross_attn	Fusion module type
`--freeze_rgb`	False	Freeze RGB encoder during pretraining

Fine-tuning on Downstream Tasks

# Grasp success prediction
python train_finetune.py \
    --task grasp_success \
    --pretrained checkpoints/evmanip_pretrained \
    --label_fraction 0.1 \
    --epochs 50

# Slip detection
python train_finetune.py \
    --task slip_detection \
    --pretrained checkpoints/evmanip_pretrained \
    --label_fraction 0.1 \
    --epochs 50

# Contact moment detection
python train_finetune.py \
    --task contact_detection \
    --pretrained checkpoints/evmanip_pretrained \
    --label_fraction 0.1 \
    --epochs 50

Repository Structure

evmanip/
├── configs/                  # Training configs
│   ├── pretrain.yaml
│   └── finetune/
│       ├── grasp_success.yaml
│       ├── slip_detection.yaml
│       └── contact_detection.yaml
├── data/
│   ├── robomimic/            # RoboMimic RGB-D data
│   └── events/               # Synthetic event streams (generated)
├── models/
│   ├── event_encoder.py      # Spatiotemporal event ViT
│   ├── rgb_encoder.py        # RGB-D ViT encoder
│   ├── fusion.py             # Cross-modal attention fusion
│   └── evmanip.py            # Full EvManip model
├── losses/
│   ├── contact_prediction.py
│   ├── cross_modal.py
│   └── motion_forecast.py
├── scripts/
│   ├── download_robomimic.py
│   ├── generate_events.py
│   └── visualize_events.py
├── train_pretrain.py
├── train_finetune.py
├── evaluate.py
├── requirements.txt
└── README.md

Datasets

Dataset	Usage	Source
RoboMimic	Pretraining (RGB-D → synthetic events)	Link
MimicGen	Pretraining augmentation	Link
E-Grasp	Real-event fine-tuning validation	Link
Neuro-Grasp	Real-event evaluation	Link

Related Work

Event Camera Pretraining:

MEM: Masked Event Modeling (WACV 2024) — SSL pretraining for classification, not manipulation
TESPEC (ICCV 2025) — Temporal event pretraining, no manipulation tasks

Manipulation Representations:

R3M — RGB video pretraining for manipulation
MCR (ICLR 2025) — Manipulation-centric representations, RGB only

Event Cameras for Manipulation (task-specific, no pretraining):

Event-Grasping Dataset (2020)
Neuromorphic Slip Detection (2020)

EvManip is the first work at the intersection of all three.

Citation

@article{evmanip2025,
  title     = {EvManip: A Foundation Model for Robot Manipulation using Event Cameras},
  author    = {Your Name},
  journal   = {arXiv preprint},
  year      = {2025}
}

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvManip: A Foundation Model for Robot Manipulation using Event Cameras

Overview

Key Contributions

Why Event Cameras for Manipulation?

Results (Preliminary)

Installation

Data Pipeline

Pretraining

Fine-tuning on Downstream Tasks

Repository Structure

Datasets

Related Work

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Folders and files

Latest commit

History

Repository files navigation

EvManip: A Foundation Model for Robot Manipulation using Event Cameras

Overview

Key Contributions

Why Event Cameras for Manipulation?

Results (Preliminary)

Installation

Data Pipeline

Pretraining

Fine-tuning on Downstream Tasks

Repository Structure

Datasets

Related Work

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages