GPU-Accelerated Data Engine for Robot Foundation Models
RoboCache is a high-performance CUDA library for real-time sensor preprocessing in robotics. Eliminates CPU dataloader bottlenecks with GPU-accelerated temporal alignment and point cloud voxelization.
Key Features:
- 🚀 Sub-millisecond latency - 0.021-0.035ms on H100 (measured)
- ⚡ GPU-accelerated with BF16 - CUDA kernels with vectorized loads
- 🎯 Production-ready - A100/H100 validated, ROS 2 integration
- 🔧 Battle-tested - 24h burn-in, Compute Sanitizer verified
import torch
import robocache
# 3-stream multimodal fusion (vision + proprioception + IMU)
vision = torch.randn(4, 30, 512, dtype=torch.bfloat16, device='cuda')
vision_times = torch.linspace(0, 1, 30, device='cuda').expand(4, -1)
proprio = torch.randn(4, 100, 64, dtype=torch.bfloat16, device='cuda')
proprio_times = torch.linspace(0, 1, 100, device='cuda').expand(4, -1)
imu = torch.randn(4, 200, 12, dtype=torch.bfloat16, device='cuda')
imu_times = torch.linspace(0, 1, 200, device='cuda').expand(4, -1)
target_times = torch.linspace(0, 1, 50, device='cuda').expand(4, -1)
# Fuse all streams to common timeline
fused = robocache.fuse_multimodal(
vision, vision_times,
proprio, proprio_times,
imu, imu_times,
target_times
)
# Output: (4, 50, 588) - batch × time × (512+64+12)
# H100: 0.034ms ± 0.002ms (n=100) | A100: 0.057ms (P50)Point Cloud Voxelization:
# LiDAR → 3D voxel grid
points = torch.rand(500000, 3, device='cuda') * 20.0 - 10.0
voxel_grid = robocache.voxelize_pointcloud(
points,
grid_min=[-10.0, -10.0, -10.0],
voxel_size=0.05, # 5cm voxels
grid_size=[128, 128, 128],
mode='occupancy'
)
# H100: 24.3 billion points/sec (500K pts @ 0.0205ms, measured)git clone https://github.com/GOATnote-Inc/robogoat.git
cd robogoat/robocache
# Install PyTorch with CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Build CUDA extensions
python setup.py develop
# Verify
python -c "import robocache; robocache.self_test()"cd robocache
docker build -t robocache:latest -f docker/Dockerfile.runtime .
docker run --gpus all -it robocache:latestRequirements:
- NVIDIA GPU (Compute Capability ≥ 8.0)
- CUDA 12.1+ or 13.0+
- PyTorch 2.0+
Validated November 2025 on NVIDIA H100 PCIe 80GB (with P0 API - see artifacts/h100_validation_final_results.md)
| Operation | Latency (Mean ± Std) | Throughput | Validation |
|---|---|---|---|
| Trajectory Resample (32×500×256, bf16) | 0.0353 ± 0.0016 ms | 28,300 ops/s | H100 Results |
| Voxelization (500K pts, 128³ grid) | 0.0205 ms | 24.3 B pts/s | H100 Results |
| Multimodal Fusion (3 streams→50Hz) | 0.0339 ± 0.0022 ms | 29,500 ops/s | H100 Results |
Statistical Rigor: 5 seeds × 50 repeats = 250 measurements per config
Hardware: NVIDIA H100 PCIe 81GB, CUDA 13.0, Driver 580.95
Methodology: torch.cuda.Event timing with warmup, CSV export
Full Report: Benchmark Summary
| Operation | Latency (P50) | Hardware |
|---|---|---|
| Multimodal Fusion (3-stream) | 0.057 ms | A100 SXM4 80GB |
| Voxelization (occupancy, 500K pts) | 0.032 ms | A100 SXM4 80GB |
Report: A100 Validation
RoboCache Pipeline:
┌─────────────────────────────────────────────────────────┐
│ Sensor Data (GPU) │
│ ├─ Vision Stream (30 Hz, 512D) │
│ ├─ Proprioception (100 Hz, 64D) │
│ └─ IMU (200 Hz, 12D) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ RoboCache CUDA Kernels │
│ ├─ Binary Search + Linear Interpolation │
│ ├─ Coalesced Memory Access │
│ └─ BF16 Vectorization │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Aligned Features (50 Hz, 588D) │
│ → Policy Network → Training │
└─────────────────────────────────────────────────────────┘
Key Optimizations:
- Binary search for timestamp alignment (log N complexity)
- Vectorized BF16 loads (4-element vectors, 4× bandwidth vs scalar)
- L1-resident workloads (99%+ cache hit rate for fusion/resample)
- Zero CPU/GPU transfers (end-to-end GPU pipeline)
All performance claims verified with NVIDIA profiling tools:
| Kernel | DRAM BW | SM Throughput | Warps Active | L1 Hit Rate | Report |
|---|---|---|---|---|---|
| Trajectory Resample | 0.05% | 1.27% | 12.48% | 99%+ | NCU H100 |
| Multimodal Fusion | 0.03% | 2.15% | 12.49% | 99%+ | NCU Complete |
| Voxelization | 54.17% | 14.06% | 64.83% | N/A | NCU Complete |
GPU: NVIDIA H100 PCIe (SM90) | Tool: Nsight Compute 2025.3.1.4
NCU Binary Reports: robocache/.archive/development_history/perf/ncu_reports/*.ncu-rep
Full performance benchmarking on A100-SXM4-80GB:
| Operation | H100 Latency | A100 Latency | Scaling | Report |
|---|---|---|---|---|
| Multimodal Fusion | 0.034 ms (Mean) | 0.057 ms (P50) | 0.60x | H100 Results |
| Voxelization (500K pts) | 0.021 ms (Mean) | 0.032 ms (P50) | 0.66x | H100 Results |
| Trajectory Resample | 0.035 ms (Mean) | ~0.05 ms (est.) | ~0.70x | H100 Results |
Throughput: 15-16 billion points/sec (count/occupancy), 5-7 B pts/s (mean/max)
Status: ✅ Production-validated on both H100 (SM90) and A100 (SM80)
H100 Full Pipeline Profiling:
- End-to-end latency: 1.56ms/step (12.84× faster than 20ms target)
- RoboCache preprocessing: 19.3% of GPU time (83.4μs per call)
- Throughput: 20,548 episodes/sec
- Memory overhead: 0.15% (negligible)
Report: Nsight Systems H100
Memory Hierarchy Analysis:
- Trajectory/Fusion: L1-resident (99%+ cache hit rate) → Optimal for binary search
- Voxelization: 54% DRAM utilization → Excellent for atomic scatter workload
- Roofline Position: Each kernel optimized for its workload pattern
Production Validation:
- ✅ All latency targets exceeded
- ✅ H100 + A100 cross-validation complete
- ✅ NCU metrics confirm architecture-appropriate optimization
- ✅ Nsight Systems confirms zero CPU bottleneck
Summary: Expert Profiling Report
cd examples/ros2_node
ros2 run robocache_ros robot_preprocessor.pycd examples/isaac_sim_demo
python train_robot_policy.py --mode robocachecd examples/multi_gpu
python benchmark_multi_gpu.py --gpus 4cd robocache
# Unit tests
pytest tests/test_*_correctness.py -v
# Performance tests
python benchmarks/smoke.py
# Stress tests
pytest tests/stress/ -vCI Status:
- ✅ Lint + CPU tests (every PR)
- ✅ Security scan (weekly)
- ✅ Compute Sanitizer (weekly memcheck/racecheck)
@software{robocache2025,
title={RoboCache: GPU-Accelerated Data Engine for Robot Learning},
author={GOATnote Engineering},
year={2025},
url={https://github.com/GOATnote-Inc/robogoat},
note={H100/A100 validated, Nsight profiled}
}- Trajectory Resampling: Optimal for batch sizes 8-64. Single-sample latency may be dominated by kernel launch overhead (~5μs).
- Voxelization: Throughput scales with point count. For <10K points, CPU implementation may be competitive due to launch overhead.
- Multimodal Fusion: Currently uses 3 sequential kernel launches. Kernel fusion could reduce latency by ~40% (future optimization).
- Minimum: CUDA Compute Capability 8.0 (A100, A10G, RTX 3090)
- Tested: H100 PCIe (SM90), A100 SXM4 (SM80)
- Not tested: V100 (SM70), consumer GPUs (RTX 4090)
- BFloat16: Requires SM80+ (A100/H100). Falls back to FP32 on older hardware.
- Public API:
resample_trajectories(),voxelize_pointcloud(),fuse_multimodal()- stable - Experimental: Streaming kernels, CUTLASS-based implementations - subject to change
- Backward compatibility: See
artifacts/kernel_inventory.mdfor production vs research status
- Timestamp monotonicity: Not enforced. User must ensure monotonically increasing timestamps.
- Out-of-bounds: Voxelization clips points outside grid bounds (no error thrown).
- Empty inputs: Zero-length tensors may cause undefined behavior (add validation in production code).
- Performance claims: All claims based on H100/A100 measurements. See
artifacts/h100_validation_final_results.mdfor exact configs. - Benchmark reproducibility: Requires same hardware, driver version, and CUDA toolkit. Variations up to ±10% are normal.
For detailed analysis and optimization guidance, see:
Apache 2.0 - See LICENSE
- NVIDIA - H100/A100 GPU access, Nsight profiling tools
- PyTorch - Deep learning framework
- Robot Learning Community - Feedback and validation
- Performance Validation - NVIDIA Nsight Compute & Nsight Systems profiling
Maintained by: GOATnote Engineering
Status: Production-Ready (v1.0.0)