Fix Blackwell (sm_120) illegal memory access via CUDA stream handling#29
Open
cuzelac wants to merge 1 commit intoJeffreyXiang:mainfrom
Open
Fix Blackwell (sm_120) illegal memory access via CUDA stream handling#29cuzelac wants to merge 1 commit intoJeffreyXiang:mainfrom
cuzelac wants to merge 1 commit intoJeffreyXiang:mainfrom
Conversation
…andling PyTorch uses cudaStreamNonBlocking streams, but CuMesh launched all kernels on the default stream (stream 0) via bare <<<blocks, threads>>> syntax. On Blackwell GPUs (RTX 5090, CUDA 13.0), this stream mismatch causes "illegal memory access" errors. Changes: - Add current_stream() helper wrapping at::cuda::getCurrentCUDAStream() - Pass PyTorch's current CUDA stream to all kernel launches (<<<..., 0, stream>>>), CUB calls, cudaMemcpyAsync, and cudaMemsetAsync across all 10 source files - Fix cudaFree race conditions: on non-blocking streams, cudaFree may return memory to the pool before async kernels finish using it. Added cudaStreamSynchronize(stream) before cudaFree where the freed memory was recently used by async work on the stream - Convert cudaMemcpy to cudaMemcpyAsync with stream, adding cudaStreamSynchronize where the CPU needs to read the result - Replace timing-related cudaDeviceSynchronize calls in simplify.cu with cudaStreamSynchronize(stream) Tested on RTX 5090 32GB, PyTorch 2.10.0+cu130, CUDA 13.0, Windows. Full Trellis2 reconstruction pipeline completes without errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
|
Also filed as visualbruno/CuMesh#2 for the visualbruno fork, which is used by ComfyUI-Trellis2. |
Author
|
This may also fix #27 (fragmented mesh output on Blackwell) — same root cause (stream 0 race condition), different symptom. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CuMesh causes
CUDA error: illegal memory accesson NVIDIA Blackwell GPUs (RTX 5090, sm_120) when used with PyTorch. SettingCUDA_LAUNCH_BLOCKING=1masks the issue (by serializing all GPU work), confirming a stream synchronization problem.Root cause
All CUDA kernel launches, CUB operations, and
cudaMemcpycalls in CuMesh use the default CUDA stream (stream 0). PyTorch allocates memory oncudaStreamNonBlockingstreams, which have no implicit synchronization with stream 0. This means CuMesh kernels can execute before PyTorch's memory operations complete, or PyTorch can free/reuse memory while CuMesh kernels are still running on stream 0.This has always been undefined behavior per the CUDA programming model, but only manifests reliably on Blackwell GPUs — likely due to changes in the memory subsystem or stream scheduling.
Fix
at::cuda::getCurrentCUDAStream()in every function<<<blocks, threads, 0, stream>>>)cub::DeviceScan::ExclusiveSum(..., stream))cudaMemcpyAsyncwith the stream instead of synchronouscudaMemcpycudaStreamSynchronize(stream)before allcudaFreecalls, sincecudaFreeon non-blocking streams may free memory before async kernels finish using itnullptrand conditionally free them at the end of functions (after sync)This follows the same pattern used by nvdiffrast and cubvh.
Testing
CUDA_LAUNCH_BLOCKING=1Files changed
src/remesh/svox2vert.cu— sparse voxel grid vertex extractionsrc/connectivity.cu— mesh connectivity computationsrc/clean_up.cu— mesh cleanup operationssrc/atlas.cu— texture atlas generationsrc/shared.h— templatecompress_idsused throughoutsrc/utils.h—Buffer::free()stream safetysrc/geometry.cu,src/hash/hash.cu,src/remesh/simple_dual_contour.cu,src/simplify.cu— stream passing for kernel launches and CUB calls