The first self-supervised foundation model that learns manipulation representations from event cameras, transferring across multiple contact-rich robot manipulation tasks.
Existing manipulation foundation models (R3M, MVP, MCR) rely entirely on RGB or RGB-D inputs — they fail under fast motion and low-light conditions common in real manipulation scenarios. Event cameras offer microsecond temporal resolution and high dynamic range, but all prior event-camera work trains task-specific models from scratch with no transferable representations.
EvManip fills this gap. We pretrain a cross-modal encoder on large-scale event + RGB-D manipulation data using self-supervised objectives that require zero human labels. The pretrained encoder transfers to multiple downstream manipulation tasks, outperforming RGB-only baselines — especially in low-data, fast-motion, and low-light regimes.
RGB-D Frame → ViT Encoder ──────────────┐
├──► Cross-Modal Fusion → Manipulation Representation
Event Stream → Spatiotemporal │
Event Encoder ────────────┘
↓
┌─────────────┼─────────────┐
▼ ▼ ▼
Grasp Success Slip Detection Contact Timing
- First foundation model combining event cameras and RGB-D for manipulation representation learning
- Three self-supervised pretraining objectives requiring zero human annotation:
- Contact Prediction — learn contact dynamics from event spikes at touch moments
- Cross-Modal Consistency — align event and RGB-D latent spaces contrastively
- Motion Forecasting — predict future event distributions from current stream
- Transfers across tasks — one pretrained encoder, multiple downstream manipulation tasks
- Data pipeline — v2e-based synthetic event generation from any RGB-D manipulation dataset
| Challenge | RGB-D | Event Camera |
|---|---|---|
| Fast gripper motion | Motion blur | Microsecond resolution, no blur |
| Low-light workspace | Noisy / fails | High dynamic range (120dB) |
| Contact detection latency | Next frame (~33ms) | Sub-millisecond response |
| Slip detection | Misses incipient slip | Fires instantly on object movement |
Evaluating pretrained encoder (frozen) fine-tuned on downstream tasks with limited labels:(placeholder)
| Task | Train from Scratch | DINOv2 (RGB) | MEM (Events only) | EvManip (Ours) |
|---|---|---|---|---|
| Grasp Success Prediction | 61.2% | 74.8% | 68.3% | 81.4% |
| Slip Detection | 54.7% | 63.1% | 71.2% | 79.6% |
| Contact Moment Detection (ms error) | 18.4ms | 14.2ms | 9.8ms | 6.1ms |
Results with 10% label supervision. Full results in paper.
git clone https://github.com/yourusername/evmanip.git
cd evmanip
pip install -r requirements.txtRequirements:
- Python 3.8+
- PyTorch 2.0+
- v2e (for synthetic event generation)
- RoboMimic
pip install torch torchvision
pip install v2e
pip install robomimicWe use v2e to convert RGB-D manipulation videos to synthetic event streams. No real event camera hardware required for pretraining.
# Step 1: Download RoboMimic dataset
python scripts/download_robomimic.py --task lift square transport
# Step 2: Convert RGB videos to synthetic events
python scripts/generate_events.py \
--input_dir data/robomimic/videos \
--output_dir data/events \
--threshold 0.2 \
--noise_rate 0.01
# Step 3: Verify event quality
python scripts/visualize_events.py --sequence data/events/lift_000python train_pretrain.py \
--data_dir data/events \
--batch_size 64 \
--epochs 200 \
--lr 1e-4 \
--loss contact_pred cross_modal motion_forecast \
--output_dir checkpoints/evmanip_pretrainedKey pretraining arguments:
| Argument | Default | Description |
|---|---|---|
--event_window_ms |
10 | Temporal window for event tokenization |
--loss_weights |
0.4 0.4 0.2 | Weights for 3 pretraining losses |
--fusion |
cross_attn | Fusion module type |
--freeze_rgb |
False | Freeze RGB encoder during pretraining |
# Grasp success prediction
python train_finetune.py \
--task grasp_success \
--pretrained checkpoints/evmanip_pretrained \
--label_fraction 0.1 \
--epochs 50
# Slip detection
python train_finetune.py \
--task slip_detection \
--pretrained checkpoints/evmanip_pretrained \
--label_fraction 0.1 \
--epochs 50
# Contact moment detection
python train_finetune.py \
--task contact_detection \
--pretrained checkpoints/evmanip_pretrained \
--label_fraction 0.1 \
--epochs 50evmanip/
├── configs/ # Training configs
│ ├── pretrain.yaml
│ └── finetune/
│ ├── grasp_success.yaml
│ ├── slip_detection.yaml
│ └── contact_detection.yaml
├── data/
│ ├── robomimic/ # RoboMimic RGB-D data
│ └── events/ # Synthetic event streams (generated)
├── models/
│ ├── event_encoder.py # Spatiotemporal event ViT
│ ├── rgb_encoder.py # RGB-D ViT encoder
│ ├── fusion.py # Cross-modal attention fusion
│ └── evmanip.py # Full EvManip model
├── losses/
│ ├── contact_prediction.py
│ ├── cross_modal.py
│ └── motion_forecast.py
├── scripts/
│ ├── download_robomimic.py
│ ├── generate_events.py
│ └── visualize_events.py
├── train_pretrain.py
├── train_finetune.py
├── evaluate.py
├── requirements.txt
└── README.md
| Dataset | Usage | Source |
|---|---|---|
| RoboMimic | Pretraining (RGB-D → synthetic events) | Link |
| MimicGen | Pretraining augmentation | Link |
| E-Grasp | Real-event fine-tuning validation | Link |
| Neuro-Grasp | Real-event evaluation | Link |
Event Camera Pretraining:
- MEM: Masked Event Modeling (WACV 2024) — SSL pretraining for classification, not manipulation
- TESPEC (ICCV 2025) — Temporal event pretraining, no manipulation tasks
Manipulation Representations:
- R3M — RGB video pretraining for manipulation
- MCR (ICLR 2025) — Manipulation-centric representations, RGB only
Event Cameras for Manipulation (task-specific, no pretraining):
- Event-Grasping Dataset (2020)
- Neuromorphic Slip Detection (2020)
EvManip is the first work at the intersection of all three.
@article{evmanip2025,
title = {EvManip: A Foundation Model for Robot Manipulation using Event Cameras},
author = {Your Name},
journal = {arXiv preprint},
year = {2025}
}MIT License. See LICENSE for details.