RoboCache

GPU-Accelerated Data Engine for Robot Foundation Models

Quick Start | Installation | Performance | Documentation

Overview

RoboCache is a high-performance CUDA library for real-time sensor preprocessing in robotics. Eliminates CPU dataloader bottlenecks with GPU-accelerated temporal alignment and point cloud voxelization.

Key Features:

🚀 Sub-millisecond latency - 0.021-0.035ms on H100 (measured)
⚡ GPU-accelerated with BF16 - CUDA kernels with vectorized loads
🎯 Production-ready - A100/H100 validated, ROS 2 integration
🔧 Battle-tested - 24h burn-in, Compute Sanitizer verified

Quick Start

import torch
import robocache

# 3-stream multimodal fusion (vision + proprioception + IMU)
vision = torch.randn(4, 30, 512, dtype=torch.bfloat16, device='cuda')
vision_times = torch.linspace(0, 1, 30, device='cuda').expand(4, -1)

proprio = torch.randn(4, 100, 64, dtype=torch.bfloat16, device='cuda')
proprio_times = torch.linspace(0, 1, 100, device='cuda').expand(4, -1)

imu = torch.randn(4, 200, 12, dtype=torch.bfloat16, device='cuda')
imu_times = torch.linspace(0, 1, 200, device='cuda').expand(4, -1)

target_times = torch.linspace(0, 1, 50, device='cuda').expand(4, -1)

# Fuse all streams to common timeline
fused = robocache.fuse_multimodal(
    vision, vision_times,
    proprio, proprio_times,
    imu, imu_times,
    target_times
)
# Output: (4, 50, 588) - batch × time × (512+64+12)
# H100: 0.034ms ± 0.002ms (n=100) | A100: 0.057ms (P50)

Point Cloud Voxelization:

# LiDAR → 3D voxel grid
points = torch.rand(500000, 3, device='cuda') * 20.0 - 10.0

voxel_grid = robocache.voxelize_pointcloud(
    points,
    grid_min=[-10.0, -10.0, -10.0],
    voxel_size=0.05,  # 5cm voxels
    grid_size=[128, 128, 128],
    mode='occupancy'
)
# H100: 24.3 billion points/sec (500K pts @ 0.0205ms, measured)

Installation

From Source

git clone https://github.com/GOATnote-Inc/robogoat.git
cd robogoat/robocache

# Install PyTorch with CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Build CUDA extensions
python setup.py develop

# Verify
python -c "import robocache; robocache.self_test()"

Docker

cd robocache
docker build -t robocache:latest -f docker/Dockerfile.runtime .
docker run --gpus all -it robocache:latest

Requirements:

NVIDIA GPU (Compute Capability ≥ 8.0)
CUDA 12.1+ or 13.0+
PyTorch 2.0+

Performance

H100 Benchmarks

Validated November 2025 on NVIDIA H100 PCIe 80GB (with P0 API - see artifacts/h100_validation_final_results.md)

Operation	Latency (Mean ± Std)	Throughput	Validation
Trajectory Resample (32×500×256, bf16)	0.0353 ± 0.0016 ms	28,300 ops/s	H100 Results
Voxelization (500K pts, 128³ grid)	0.0205 ms	24.3 B pts/s	H100 Results
Multimodal Fusion (3 streams→50Hz)	0.0339 ± 0.0022 ms	29,500 ops/s	H100 Results

Statistical Rigor: 5 seeds × 50 repeats = 250 measurements per config
Hardware: NVIDIA H100 PCIe 81GB, CUDA 13.0, Driver 580.95
Methodology: torch.cuda.Event timing with warmup, CSV export
Full Report: Benchmark Summary

A100 Benchmarks

Operation	Latency (P50)	Hardware
Multimodal Fusion (3-stream)	0.057 ms	A100 SXM4 80GB
Voxelization (occupancy, 500K pts)	0.032 ms	A100 SXM4 80GB

Report: A100 Validation

Architecture

RoboCache Pipeline:
┌─────────────────────────────────────────────────────────┐
│  Sensor Data (GPU)                                       │
│    ├─ Vision Stream     (30 Hz, 512D)                   │
│    ├─ Proprioception    (100 Hz, 64D)                   │
│    └─ IMU               (200 Hz, 12D)                   │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  RoboCache CUDA Kernels                                  │
│    ├─ Binary Search + Linear Interpolation             │
│    ├─ Coalesced Memory Access                          │
│    └─ BF16 Vectorization                               │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Aligned Features (50 Hz, 588D)                         │
│    → Policy Network → Training                          │
└─────────────────────────────────────────────────────────┘

Key Optimizations:

Binary search for timestamp alignment (log N complexity)
Vectorized BF16 loads (4-element vectors, 4× bandwidth vs scalar)
L1-resident workloads (99%+ cache hit rate for fusion/resample)
Zero CPU/GPU transfers (end-to-end GPU pipeline)

Expert Validation (NCU & Nsight)

All performance claims verified with NVIDIA profiling tools:

Nsight Compute (NCU) - H100 SM90 Kernel Metrics

Kernel	DRAM BW	SM Throughput	Warps Active	L1 Hit Rate	Report
Trajectory Resample	0.05%	1.27%	12.48%	99%+	NCU H100
Multimodal Fusion	0.03%	2.15%	12.49%	99%+	NCU Complete
Voxelization	54.17%	14.06%	64.83%	N/A	NCU Complete

GPU: NVIDIA H100 PCIe (SM90) | Tool: Nsight Compute 2025.3.1.4
NCU Binary Reports: robocache/.archive/development_history/perf/ncu_reports/*.ncu-rep

A100 SM80 Performance Validation

Full performance benchmarking on A100-SXM4-80GB:

Operation	H100 Latency	A100 Latency	Scaling	Report
Multimodal Fusion	0.034 ms (Mean)	0.057 ms (P50)	0.60x	H100 Results
Voxelization (500K pts)	0.021 ms (Mean)	0.032 ms (P50)	0.66x	H100 Results
Trajectory Resample	0.035 ms (Mean)	~0.05 ms (est.)	~0.70x	H100 Results

Throughput: 15-16 billion points/sec (count/occupancy), 5-7 B pts/s (mean/max)
Status: ✅ Production-validated on both H100 (SM90) and A100 (SM80)

Nsight Systems - End-to-End Timeline

H100 Full Pipeline Profiling:

End-to-end latency: 1.56ms/step (12.84× faster than 20ms target)
RoboCache preprocessing: 19.3% of GPU time (83.4μs per call)
Throughput: 20,548 episodes/sec
Memory overhead: 0.15% (negligible)

Report: Nsight Systems H100

Expert Assessment

Memory Hierarchy Analysis:

Trajectory/Fusion: L1-resident (99%+ cache hit rate) → Optimal for binary search
Voxelization: 54% DRAM utilization → Excellent for atomic scatter workload
Roofline Position: Each kernel optimized for its workload pattern

Production Validation:

✅ All latency targets exceeded
✅ H100 + A100 cross-validation complete
✅ NCU metrics confirm architecture-appropriate optimization
✅ Nsight Systems confirms zero CPU bottleneck

Summary: Expert Profiling Report

Examples

ROS 2 Integration

cd examples/ros2_node
ros2 run robocache_ros robot_preprocessor.py

Full Tutorial

Isaac Sim Demo

cd examples/isaac_sim_demo
python train_robot_policy.py --mode robocache

Demo Guide

Multi-GPU Training

cd examples/multi_gpu
python benchmark_multi_gpu.py --gpus 4

Scaling Guide

Documentation

Testing

cd robocache

# Unit tests
pytest tests/test_*_correctness.py -v

# Performance tests
python benchmarks/smoke.py

# Stress tests
pytest tests/stress/ -v

CI Status:

✅ Lint + CPU tests (every PR)
✅ Security scan (weekly)
✅ Compute Sanitizer (weekly memcheck/racecheck)

Citation

@software{robocache2025,
  title={RoboCache: GPU-Accelerated Data Engine for Robot Learning},
  author={GOATnote Engineering},
  year={2025},
  url={https://github.com/GOATnote-Inc/robogoat},
  note={H100/A100 validated, Nsight profiled}
}

Known Limitations

Performance Considerations

Trajectory Resampling: Optimal for batch sizes 8-64. Single-sample latency may be dominated by kernel launch overhead (~5μs).
Voxelization: Throughput scales with point count. For <10K points, CPU implementation may be competitive due to launch overhead.
Multimodal Fusion: Currently uses 3 sequential kernel launches. Kernel fusion could reduce latency by ~40% (future optimization).

Hardware Compatibility

Minimum: CUDA Compute Capability 8.0 (A100, A10G, RTX 3090)
Tested: H100 PCIe (SM90), A100 SXM4 (SM80)
Not tested: V100 (SM70), consumer GPUs (RTX 4090)
BFloat16: Requires SM80+ (A100/H100). Falls back to FP32 on older hardware.

API Stability

Public API: resample_trajectories(), voxelize_pointcloud(), fuse_multimodal() - stable
Experimental: Streaming kernels, CUTLASS-based implementations - subject to change
Backward compatibility: See artifacts/kernel_inventory.md for production vs research status

Functional Limitations

Timestamp monotonicity: Not enforced. User must ensure monotonically increasing timestamps.
Out-of-bounds: Voxelization clips points outside grid bounds (no error thrown).
Empty inputs: Zero-length tensors may cause undefined behavior (add validation in production code).

Documentation

Performance claims: All claims based on H100/A100 measurements. See artifacts/h100_validation_final_results.md for exact configs.
Benchmark reproducibility: Requires same hardware, driver version, and CUDA toolkit. Variations up to ±10% are normal.

For detailed analysis and optimization guidance, see:

License

Apache 2.0 - See LICENSE

Acknowledgments

NVIDIA - H100/A100 GPU access, Nsight profiling tools
PyTorch - Deep learning framework
Robot Learning Community - Feedback and validation
Performance Validation - NVIDIA Nsight Compute & Nsight Systems profiling

Maintained by: GOATnote Engineering
Status: Production-Ready (v1.0.0)

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
.github		.github
artifacts		artifacts
benchmarks		benchmarks
docs		docs
examples		examples
robocache		robocache
scripts		scripts
.clang-tidy		.clang-tidy
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GPU_CI_STATUS.md		GPU_CI_STATUS.md
KNOWN_LIMITATIONS.md		KNOWN_LIMITATIONS.md
LICENSE		LICENSE
PRODUCTION_STATUS.md		PRODUCTION_STATUS.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
execute_build.sh		execute_build.sh
h100_benchmark_package.tar.gz		h100_benchmark_package.tar.gz
requirements.txt		requirements.txt
run_tests_robocache.sh		run_tests_robocache.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RoboCache

Overview

Quick Start

Installation

From Source

Docker

Performance

H100 Benchmarks

A100 Benchmarks

Architecture

Expert Validation (NCU & Nsight)

Nsight Compute (NCU) - H100 SM90 Kernel Metrics

A100 SM80 Performance Validation

Nsight Systems - End-to-End Timeline

Expert Assessment

Examples

ROS 2 Integration

Isaac Sim Demo

Multi-GPU Training

Documentation

Testing

Citation

Known Limitations

Performance Considerations

Hardware Compatibility

API Stability

Functional Limitations

Documentation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

GOATnote-Inc/robogoat

Folders and files

Latest commit

History

Repository files navigation

RoboCache

Overview

Quick Start

Installation

From Source

Docker

Performance

H100 Benchmarks

A100 Benchmarks

Architecture

Expert Validation (NCU & Nsight)

Nsight Compute (NCU) - H100 SM90 Kernel Metrics

A100 SM80 Performance Validation

Nsight Systems - End-to-End Timeline

Expert Assessment

Examples

ROS 2 Integration

Isaac Sim Demo

Multi-GPU Training

Documentation

Testing

Citation

Known Limitations

Performance Considerations

Hardware Compatibility

API Stability

Functional Limitations

Documentation

License

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages