Various improvements to the docs by giordano · Pull Request #3030 · JuliaGPU/CUDA.jl

giordano · 2026-02-13T15:49:38Z

I had some...uhm...fun in the last couple of days trying to port some C++ CUDA code to CUDA.jl, and profile it. I dumped into this PR my experience, hoping to make lives of people after me a little bit easier 🙂

…onding C/C++ variables

github-actions · 2026-02-13T15:50:12Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/src/device/intrinsics/indexing.jl b/src/device/intrinsics/indexing.jl
index a42b003cd..36cde4ab9 100644
--- a/src/device/intrinsics/indexing.jl
+++ b/src/device/intrinsics/indexing.jl
@@ -66,32 +66,32 @@ end
 """
     gridDim()::NamedTuple
 
-Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
+    Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
 """
 @inline gridDim() =   (x=gridDim_x(),   y=gridDim_y(),   z=gridDim_z())
 
 """
     blockIdx()::NamedTuple
 
-Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
+    Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
 """
 @inline blockIdx() =  (x=blockIdx_x(),  y=blockIdx_y(),  z=blockIdx_z())
 
 """
     blockDim()::NamedTuple
 
-Returns the dimensions of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-These dimensions have the same starting index as the `blockDim` built-in variable in the C/C++ extension.
+    Returns the dimensions of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These dimensions have the same starting index as the `blockDim` built-in variable in the C/C++ extension.
 """
 @inline blockDim() =  (x=blockDim_x(),  y=blockDim_y(),  z=blockDim_z())
 
 """
     threadIdx()::NamedTuple
 
-Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
+    Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
 """
 @inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())
 
@@ -99,7 +99,7 @@ These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++
     warpsize()::Int32
 
 Returns the warp size (in threads).
-This corresponds to the `warpSize` built-in variable in the C/C++ extension.
+    This corresponds to the `warpSize` built-in variable in the C/C++ extension.
 """
 @inline warpsize() = ccall("llvm.nvvm.read.ptx.sreg.warpsize", llvmcall, Int32, ())
 
@@ -107,7 +107,7 @@ This corresponds to the `warpSize` built-in variable in the C/C++ extension.
     laneid()::Int32
 
 Returns the thread's lane within the warp.
-This ID is 1-based.
+    This ID is 1-based.
 """
 @inline laneid() = ccall("llvm.nvvm.read.ptx.sreg.laneid", llvmcall, Int32, ()) + 1i32

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `337b7a7`	Previous: `7a27d77`	Ratio
`latency/precompile`	`44169430501.5` ns	`44455759835` ns	`0.99`
`latency/ttfp`	`13133988338` ns	`13140153243` ns	`1.00`
`latency/import`	`3764428041` ns	`3755312424` ns	`1.00`
`integration/volumerhs`	`9435111.5` ns	`9442840` ns	`1.00`
`integration/byval/slices=1`	`145882` ns	`145598` ns	`1.00`
`integration/byval/slices=3`	`423338` ns	`422554` ns	`1.00`
`integration/byval/reference`	`144101` ns	`143811` ns	`1.00`
`integration/byval/slices=2`	`284586` ns	`284011` ns	`1.00`
`integration/cudadevrt`	`102648` ns	`102397` ns	`1.00`
`kernel/indexing`	`13579` ns	`13434` ns	`1.01`
`kernel/indexing_checked`	`14218` ns	`13908` ns	`1.02`
`kernel/occupancy`	`647.9156626506024` ns	`644.5636363636364` ns	`1.01`
`kernel/launch`	`2059.4` ns	`2090.3` ns	`0.99`
`kernel/rand`	`14467` ns	`14479` ns	`1.00`
`array/reverse/1d`	`19050` ns	`18661` ns	`1.02`
`array/reverse/2dL_inplace`	`66409` ns	`66252` ns	`1.00`
`array/reverse/1dL`	`69247` ns	`68893` ns	`1.01`
`array/reverse/2d`	`21375` ns	`21087` ns	`1.01`
`array/reverse/1d_inplace`	`10869.333333333334` ns	`10503.833333333332` ns	`1.03`
`array/reverse/2d_inplace`	`11034` ns	`11399.5` ns	`0.97`
`array/reverse/2dL`	`73430.5` ns	`73163` ns	`1.00`
`array/reverse/1dL_inplace`	`66552` ns	`66146` ns	`1.01`
`array/copy`	`18566` ns	`18502.5` ns	`1.00`
`array/iteration/findall/int`	`146514` ns	`146476.5` ns	`1.00`
`array/iteration/findall/bool`	`130975` ns	`130795` ns	`1.00`
`array/iteration/findfirst/int`	`84094.5` ns	`84133` ns	`1.00`
`array/iteration/findfirst/bool`	`81344` ns	`81624.5` ns	`1.00`
`array/iteration/scalar`	`65699` ns	`65804` ns	`1.00`
`array/iteration/logical`	`199014` ns	`198187.5` ns	`1.00`
`array/iteration/findmin/1d`	`86127.5` ns	`86504` ns	`1.00`
`array/iteration/findmin/2d`	`117201` ns	`117154` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`38955.5` ns	`41088.5` ns	`0.95`
`array/reductions/reduce/Int64/dims=1`	`51296` ns	`52190.5` ns	`0.98`
`array/reductions/reduce/Int64/dims=2`	`59074` ns	`59179` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`87341` ns	`87126` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`84657.5` ns	`84418.5` ns	`1.00`
`array/reductions/reduce/Float32/1d`	`34039.5` ns	`34001` ns	`1.00`
`array/reductions/reduce/Float32/dims=1`	`49262` ns	`39890` ns	`1.23`
`array/reductions/reduce/Float32/dims=2`	`56573` ns	`55899` ns	`1.01`
`array/reductions/reduce/Float32/dims=1L`	`51645` ns	`51535` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`69784.5` ns	`69798` ns	`1.00`
`array/reductions/mapreduce/Int64/1d`	`38808` ns	`40980.5` ns	`0.95`
`array/reductions/mapreduce/Int64/dims=1`	`51392.5` ns	`41741` ns	`1.23`
`array/reductions/mapreduce/Int64/dims=2`	`58970` ns	`59036` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1L`	`87347` ns	`87134` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`84631.5` ns	`84427` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`33862` ns	`33457` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1`	`49049` ns	`48711` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=2`	`56551` ns	`55941` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=1L`	`51676` ns	`51352` ns	`1.01`
`array/reductions/mapreduce/Float32/dims=2L`	`69531` ns	`68956` ns	`1.01`
`array/broadcast`	`20650.5` ns	`20251` ns	`1.02`
`array/copyto!/gpu_to_gpu`	`10746.333333333334` ns	`10684.333333333334` ns	`1.01`
`array/copyto!/cpu_to_gpu`	`216903` ns	`214898` ns	`1.01`
`array/copyto!/gpu_to_cpu`	`283112` ns	`281876` ns	`1.00`
`array/accumulate/Int64/1d`	`118907` ns	`118336` ns	`1.00`
`array/accumulate/Int64/dims=1`	`79884` ns	`79780` ns	`1.00`
`array/accumulate/Int64/dims=2`	`157192` ns	`155968.5` ns	`1.01`
`array/accumulate/Int64/dims=1L`	`1707094` ns	`1694089` ns	`1.01`
`array/accumulate/Int64/dims=2L`	`961377` ns	`960949` ns	`1.00`
`array/accumulate/Float32/1d`	`101446.5` ns	`100823` ns	`1.01`
`array/accumulate/Float32/dims=1`	`76486` ns	`76350` ns	`1.00`
`array/accumulate/Float32/dims=2`	`144456.5` ns	`144365` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1585237` ns	`1584729` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`657765` ns	`656302` ns	`1.00`
`array/construct`	`1277.9` ns	`1283.1` ns	`1.00`
`array/random/randn/Float32`	`43362` ns	`36610` ns	`1.18`
`array/random/randn!/Float32`	`30384` ns	`30335` ns	`1.00`
`array/random/rand!/Int64`	`29612` ns	`26934` ns	`1.10`
`array/random/rand!/Float32`	`8249.666666666666` ns	`8186.666666666667` ns	`1.01`
`array/random/rand/Int64`	`35367` ns	`30201.5` ns	`1.17`
`array/random/rand/Float32`	`12585` ns	`12396` ns	`1.02`
`array/permutedims/4d`	`51060.5` ns	`52729` ns	`0.97`
`array/permutedims/2d`	`52784.5` ns	`52645` ns	`1.00`
`array/permutedims/3d`	`53039` ns	`53080` ns	`1.00`
`array/sorting/1d`	`2735543` ns	`2736443` ns	`1.00`
`array/sorting/by`	`3305108.5` ns	`3305811` ns	`1.00`
`array/sorting/2d`	`1068212` ns	`1071655.5` ns	`1.00`
`cuda/synchronization/stream/auto`	`993.375` ns	`1034.5263157894738` ns	`0.96`
`cuda/synchronization/stream/nonblocking`	`7660.700000000001` ns	`7705.9` ns	`0.99`
`cuda/synchronization/stream/blocking`	`816.2019230769231` ns	`784.4516129032259` ns	`1.04`
`cuda/synchronization/context/auto`	`1146` ns	`1133.5` ns	`1.01`
`cuda/synchronization/context/nonblocking`	`7203.1` ns	`7594.6` ns	`0.95`
`cuda/synchronization/context/blocking`	`892.9795918367347` ns	`885.6792452830189` ns	`1.01`

This comment was automatically generated by workflow using github-action-benchmark.

codecov · 2026-02-16T00:59:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.48%. Comparing base (7a27d77) to head (337b7a7).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3030      +/-   ##
==========================================
+ Coverage   89.46%   89.48%   +0.01%     
==========================================
  Files         148      148              
  Lines       13047    13047              
==========================================
+ Hits        11673    11675       +2     
+ Misses       1374     1372       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

giordano added 3 commits February 13, 2026 15:00

Update link to new CUDA programming guide

964b509

[docs] Make it crystal clear that some indices different from corresp…

17ea30f

…onding C/C++ variables

[docs] Add more troubleshooting information for Nsight Compute

f6d638b

github-actions bot reviewed Feb 13, 2026

View reviewed changes

[docs] Try to fix reference

337b7a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various improvements to the docs#3030

Various improvements to the docs#3030
giordano wants to merge 4 commits intoJuliaGPU:masterfrom
giordano:mg/docs

giordano commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

codecov bot commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

giordano commented Feb 13, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

codecov bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot left a comment •

edited

Loading

codecov bot commented Feb 16, 2026 •

edited

Loading