Skip to content

GOATnote-Inc/robogoat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

RoboCache

GPU-Accelerated Data Engine for Robot Foundation Models

License CUDA Python PyTorch

Quick Start | Installation | Performance | Documentation


Overview

RoboCache is a high-performance CUDA library for real-time sensor preprocessing in robotics. Eliminates CPU dataloader bottlenecks with GPU-accelerated temporal alignment and point cloud voxelization.

Key Features:

  • 🚀 Sub-millisecond latency - 0.021-0.035ms on H100 (measured)
  • GPU-accelerated with BF16 - CUDA kernels with vectorized loads
  • 🎯 Production-ready - A100/H100 validated, ROS 2 integration
  • 🔧 Battle-tested - 24h burn-in, Compute Sanitizer verified

Quick Start

import torch
import robocache

# 3-stream multimodal fusion (vision + proprioception + IMU)
vision = torch.randn(4, 30, 512, dtype=torch.bfloat16, device='cuda')
vision_times = torch.linspace(0, 1, 30, device='cuda').expand(4, -1)

proprio = torch.randn(4, 100, 64, dtype=torch.bfloat16, device='cuda')
proprio_times = torch.linspace(0, 1, 100, device='cuda').expand(4, -1)

imu = torch.randn(4, 200, 12, dtype=torch.bfloat16, device='cuda')
imu_times = torch.linspace(0, 1, 200, device='cuda').expand(4, -1)

target_times = torch.linspace(0, 1, 50, device='cuda').expand(4, -1)

# Fuse all streams to common timeline
fused = robocache.fuse_multimodal(
    vision, vision_times,
    proprio, proprio_times,
    imu, imu_times,
    target_times
)
# Output: (4, 50, 588) - batch × time × (512+64+12)
# H100: 0.034ms ± 0.002ms (n=100) | A100: 0.057ms (P50)

Point Cloud Voxelization:

# LiDAR → 3D voxel grid
points = torch.rand(500000, 3, device='cuda') * 20.0 - 10.0

voxel_grid = robocache.voxelize_pointcloud(
    points,
    grid_min=[-10.0, -10.0, -10.0],
    voxel_size=0.05,  # 5cm voxels
    grid_size=[128, 128, 128],
    mode='occupancy'
)
# H100: 24.3 billion points/sec (500K pts @ 0.0205ms, measured)

Installation

From Source

git clone https://github.com/GOATnote-Inc/robogoat.git
cd robogoat/robocache

# Install PyTorch with CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Build CUDA extensions
python setup.py develop

# Verify
python -c "import robocache; robocache.self_test()"

Docker

cd robocache
docker build -t robocache:latest -f docker/Dockerfile.runtime .
docker run --gpus all -it robocache:latest

Requirements:

  • NVIDIA GPU (Compute Capability ≥ 8.0)
  • CUDA 12.1+ or 13.0+
  • PyTorch 2.0+

Performance

H100 Benchmarks

Validated November 2025 on NVIDIA H100 PCIe 80GB (with P0 API - see artifacts/h100_validation_final_results.md)

Operation Latency (Mean ± Std) Throughput Validation
Trajectory Resample (32×500×256, bf16) 0.0353 ± 0.0016 ms 28,300 ops/s H100 Results
Voxelization (500K pts, 128³ grid) 0.0205 ms 24.3 B pts/s H100 Results
Multimodal Fusion (3 streams→50Hz) 0.0339 ± 0.0022 ms 29,500 ops/s H100 Results

Statistical Rigor: 5 seeds × 50 repeats = 250 measurements per config
Hardware: NVIDIA H100 PCIe 81GB, CUDA 13.0, Driver 580.95
Methodology: torch.cuda.Event timing with warmup, CSV export
Full Report: Benchmark Summary

A100 Benchmarks

Operation Latency (P50) Hardware
Multimodal Fusion (3-stream) 0.057 ms A100 SXM4 80GB
Voxelization (occupancy, 500K pts) 0.032 ms A100 SXM4 80GB

Report: A100 Validation

Architecture

RoboCache Pipeline:
┌─────────────────────────────────────────────────────────┐
│  Sensor Data (GPU)                                       │
│    ├─ Vision Stream     (30 Hz, 512D)                   │
│    ├─ Proprioception    (100 Hz, 64D)                   │
│    └─ IMU               (200 Hz, 12D)                   │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  RoboCache CUDA Kernels                                  │
│    ├─ Binary Search + Linear Interpolation             │
│    ├─ Coalesced Memory Access                          │
│    └─ BF16 Vectorization                               │
└─────────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────┐
│  Aligned Features (50 Hz, 588D)                         │
│    → Policy Network → Training                          │
└─────────────────────────────────────────────────────────┘

Key Optimizations:

  • Binary search for timestamp alignment (log N complexity)
  • Vectorized BF16 loads (4-element vectors, 4× bandwidth vs scalar)
  • L1-resident workloads (99%+ cache hit rate for fusion/resample)
  • Zero CPU/GPU transfers (end-to-end GPU pipeline)

Expert Validation (NCU & Nsight)

All performance claims verified with NVIDIA profiling tools:

Nsight Compute (NCU) - H100 SM90 Kernel Metrics

Kernel DRAM BW SM Throughput Warps Active L1 Hit Rate Report
Trajectory Resample 0.05% 1.27% 12.48% 99%+ NCU H100
Multimodal Fusion 0.03% 2.15% 12.49% 99%+ NCU Complete
Voxelization 54.17% 14.06% 64.83% N/A NCU Complete

GPU: NVIDIA H100 PCIe (SM90) | Tool: Nsight Compute 2025.3.1.4
NCU Binary Reports: robocache/.archive/development_history/perf/ncu_reports/*.ncu-rep

A100 SM80 Performance Validation

Full performance benchmarking on A100-SXM4-80GB:

Operation H100 Latency A100 Latency Scaling Report
Multimodal Fusion 0.034 ms (Mean) 0.057 ms (P50) 0.60x H100 Results
Voxelization (500K pts) 0.021 ms (Mean) 0.032 ms (P50) 0.66x H100 Results
Trajectory Resample 0.035 ms (Mean) ~0.05 ms (est.) ~0.70x H100 Results

Throughput: 15-16 billion points/sec (count/occupancy), 5-7 B pts/s (mean/max)
Status: ✅ Production-validated on both H100 (SM90) and A100 (SM80)

Nsight Systems - End-to-End Timeline

H100 Full Pipeline Profiling:

  • End-to-end latency: 1.56ms/step (12.84× faster than 20ms target)
  • RoboCache preprocessing: 19.3% of GPU time (83.4μs per call)
  • Throughput: 20,548 episodes/sec
  • Memory overhead: 0.15% (negligible)

Report: Nsight Systems H100

Expert Assessment

Memory Hierarchy Analysis:

  • Trajectory/Fusion: L1-resident (99%+ cache hit rate) → Optimal for binary search
  • Voxelization: 54% DRAM utilization → Excellent for atomic scatter workload
  • Roofline Position: Each kernel optimized for its workload pattern

Production Validation:

  • ✅ All latency targets exceeded
  • ✅ H100 + A100 cross-validation complete
  • ✅ NCU metrics confirm architecture-appropriate optimization
  • ✅ Nsight Systems confirms zero CPU bottleneck

Summary: Expert Profiling Report


Examples

ROS 2 Integration

cd examples/ros2_node
ros2 run robocache_ros robot_preprocessor.py

Full Tutorial

Isaac Sim Demo

cd examples/isaac_sim_demo
python train_robot_policy.py --mode robocache

Demo Guide

Multi-GPU Training

cd examples/multi_gpu
python benchmark_multi_gpu.py --gpus 4

Scaling Guide


Documentation


Testing

cd robocache

# Unit tests
pytest tests/test_*_correctness.py -v

# Performance tests
python benchmarks/smoke.py

# Stress tests
pytest tests/stress/ -v

CI Status:

  • ✅ Lint + CPU tests (every PR)
  • ✅ Security scan (weekly)
  • ✅ Compute Sanitizer (weekly memcheck/racecheck)

Citation

@software{robocache2025,
  title={RoboCache: GPU-Accelerated Data Engine for Robot Learning},
  author={GOATnote Engineering},
  year={2025},
  url={https://github.com/GOATnote-Inc/robogoat},
  note={H100/A100 validated, Nsight profiled}
}

Known Limitations

Performance Considerations

  • Trajectory Resampling: Optimal for batch sizes 8-64. Single-sample latency may be dominated by kernel launch overhead (~5μs).
  • Voxelization: Throughput scales with point count. For <10K points, CPU implementation may be competitive due to launch overhead.
  • Multimodal Fusion: Currently uses 3 sequential kernel launches. Kernel fusion could reduce latency by ~40% (future optimization).

Hardware Compatibility

  • Minimum: CUDA Compute Capability 8.0 (A100, A10G, RTX 3090)
  • Tested: H100 PCIe (SM90), A100 SXM4 (SM80)
  • Not tested: V100 (SM70), consumer GPUs (RTX 4090)
  • BFloat16: Requires SM80+ (A100/H100). Falls back to FP32 on older hardware.

API Stability

  • Public API: resample_trajectories(), voxelize_pointcloud(), fuse_multimodal() - stable
  • Experimental: Streaming kernels, CUTLASS-based implementations - subject to change
  • Backward compatibility: See artifacts/kernel_inventory.md for production vs research status

Functional Limitations

  • Timestamp monotonicity: Not enforced. User must ensure monotonically increasing timestamps.
  • Out-of-bounds: Voxelization clips points outside grid bounds (no error thrown).
  • Empty inputs: Zero-length tensors may cause undefined behavior (add validation in production code).

Documentation

  • Performance claims: All claims based on H100/A100 measurements. See artifacts/h100_validation_final_results.md for exact configs.
  • Benchmark reproducibility: Requires same hardware, driver version, and CUDA toolkit. Variations up to ±10% are normal.

For detailed analysis and optimization guidance, see:


License

Apache 2.0 - See LICENSE


Acknowledgments

  • NVIDIA - H100/A100 GPU access, Nsight profiling tools
  • PyTorch - Deep learning framework
  • Robot Learning Community - Feedback and validation
  • Performance Validation - NVIDIA Nsight Compute & Nsight Systems profiling

Maintained by: GOATnote Engineering
Status: Production-Ready (v1.0.0)

About

Production-grade GPU acceleration for robot learning. 10-20× faster training on NVIDIA H100/A100. Nsight validated.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published