Skip to content

Add Apple Metal support for macOS/Apple Silicon#8

Draft
robtaylor wants to merge 1 commit intoNVlabs:staged-aig-releasefrom
ChipFlow:feature/metal-support
Draft

Add Apple Metal support for macOS/Apple Silicon#8
robtaylor wants to merge 1 commit intoNVlabs:staged-aig-releasefrom
ChipFlow:feature/metal-support

Conversation

@robtaylor
Copy link

Summary

This PR adds Apple Metal GPU compute support to GEM, enabling simulation on macOS with Apple Silicon (M1/M2/M3/M4) devices.

Changes

GEM Repository:

  • Add metal feature flag in Cargo.toml
  • Add Metal shader compilation in build.rs using ucc::cl_metal()
  • Add csrc/kernel_v1.metal - Metal compute shader (translated from CUDA)
  • Add src/bin/metal_test.rs - Metal test binary with CPU validation

eda-infra-rs Submodule:

  • Add Metal feature and dependencies to ulib
  • Extend Device enum with Metal(u8) variant
  • Add Metal buffer support to UVec using unified memory
  • Add MetalBuild and cl_metal() to ucc for shader compilation
  • Add _metal suffix handling in bindgen for dispatch generation

Technical Approach

CUDA Metal
__shared__ threadgroup
__syncthreads() threadgroup_barrier(mem_flags::mem_threadgroup)
__shfl_down_sync() simd_shuffle_down()
cooperative_groups::this_grid().sync() Multiple kernel dispatches

Grid-wide synchronization is achieved by splitting the kernel at sync points into multiple dispatchThreadgroups calls with explicit completion waits between stages.

Testing

Tested on Apple M4 Pro with NVDLA design (10,624 cycles):

  • Metal simulation: 1.51 seconds
  • CPU validation: ✅ PASSED (bit-identical to CPU reference)

Usage

# Build with Metal support
cargo build --release --features metal --bin metal_test

# Run simulation
cargo run --release --features metal --bin metal_test -- \
    path/to/design.gv \
    path/to/design.gemparts \
    path/to/input.vcd \
    path/to/output.vcd \
    NUM_BLOCKS \
    --check-with-cpu

Known Limitations

  • Output VCD timing is 1 clock cycle offset from Verilator (same as CUDA backend) - investigation ongoing

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com

Port GEM RTL simulator from CUDA to Metal for Apple Silicon (M1/M2/M3/M4).

Changes:
- Add metal feature flag to Cargo.toml with metal-rs dependencies
- Update build.rs to compile Metal shaders via xcrun metal/metallib
- Create kernel_v1.metal - Metal compute shader translation of CUDA kernel
  - Translate CUDA __shared__ to Metal threadgroup memory
  - Replace __shfl_down_sync with simd_shuffle_down
  - Replace cooperative_groups grid sync with multiple kernel dispatches
- Add metal_test.rs binary for running Metal simulation with CPU validation

The Metal implementation uses:
- Apple Silicon Unified Memory (MTLResourceStorageModeShared)
- 256 threads per threadgroup (matching CUDA blocks)
- Multiple dispatch calls to replace CUDA grid-wide sync

Build with: cargo build --features metal --bin metal_test
Test with: cargo run --features metal --bin metal_test -- [args] --check-with-cpu

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant