Skip to content

Various improvements to the docs#3030

Open
giordano wants to merge 4 commits intoJuliaGPU:masterfrom
giordano:mg/docs
Open

Various improvements to the docs#3030
giordano wants to merge 4 commits intoJuliaGPU:masterfrom
giordano:mg/docs

Conversation

@giordano
Copy link
Contributor

I had some...uhm...fun in the last couple of days trying to port some C++ CUDA code to CUDA.jl, and profile it. I dumped into this PR my experience, hoping to make lives of people after me a little bit easier 🙂

@github-actions
Copy link
Contributor

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.
diff --git a/src/device/intrinsics/indexing.jl b/src/device/intrinsics/indexing.jl
index a42b003cd..36cde4ab9 100644
--- a/src/device/intrinsics/indexing.jl
+++ b/src/device/intrinsics/indexing.jl
@@ -66,32 +66,32 @@ end
 """
     gridDim()::NamedTuple
 
-Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
+    Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
 """
 @inline gridDim() =   (x=gridDim_x(),   y=gridDim_y(),   z=gridDim_z())
 
 """
     blockIdx()::NamedTuple
 
-Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
+    Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
 """
 @inline blockIdx() =  (x=blockIdx_x(),  y=blockIdx_y(),  z=blockIdx_z())
 
 """
     blockDim()::NamedTuple
 
-Returns the dimensions of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-These dimensions have the same starting index as the `blockDim` built-in variable in the C/C++ extension.
+    Returns the dimensions of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These dimensions have the same starting index as the `blockDim` built-in variable in the C/C++ extension.
 """
 @inline blockDim() =  (x=blockDim_x(),  y=blockDim_y(),  z=blockDim_z())
 
 """
     threadIdx()::NamedTuple
 
-Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
+    Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+    These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
 """
 @inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())
 
@@ -99,7 +99,7 @@ These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++
     warpsize()::Int32
 
 Returns the warp size (in threads).
-This corresponds to the `warpSize` built-in variable in the C/C++ extension.
+    This corresponds to the `warpSize` built-in variable in the C/C++ extension.
 """
 @inline warpsize() = ccall("llvm.nvvm.read.ptx.sreg.warpsize", llvmcall, Int32, ())
 
@@ -107,7 +107,7 @@ This corresponds to the `warpSize` built-in variable in the C/C++ extension.
     laneid()::Int32
 
 Returns the thread's lane within the warp.
-This ID is 1-based.
+    This ID is 1-based.
 """
 @inline laneid() = ccall("llvm.nvvm.read.ptx.sreg.laneid", llvmcall, Int32, ()) + 1i32
 

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 337b7a7 Previous: 7a27d77 Ratio
latency/precompile 44169430501.5 ns 44455759835 ns 0.99
latency/ttfp 13133988338 ns 13140153243 ns 1.00
latency/import 3764428041 ns 3755312424 ns 1.00
integration/volumerhs 9435111.5 ns 9442840 ns 1.00
integration/byval/slices=1 145882 ns 145598 ns 1.00
integration/byval/slices=3 423338 ns 422554 ns 1.00
integration/byval/reference 144101 ns 143811 ns 1.00
integration/byval/slices=2 284586 ns 284011 ns 1.00
integration/cudadevrt 102648 ns 102397 ns 1.00
kernel/indexing 13579 ns 13434 ns 1.01
kernel/indexing_checked 14218 ns 13908 ns 1.02
kernel/occupancy 647.9156626506024 ns 644.5636363636364 ns 1.01
kernel/launch 2059.4 ns 2090.3 ns 0.99
kernel/rand 14467 ns 14479 ns 1.00
array/reverse/1d 19050 ns 18661 ns 1.02
array/reverse/2dL_inplace 66409 ns 66252 ns 1.00
array/reverse/1dL 69247 ns 68893 ns 1.01
array/reverse/2d 21375 ns 21087 ns 1.01
array/reverse/1d_inplace 10869.333333333334 ns 10503.833333333332 ns 1.03
array/reverse/2d_inplace 11034 ns 11399.5 ns 0.97
array/reverse/2dL 73430.5 ns 73163 ns 1.00
array/reverse/1dL_inplace 66552 ns 66146 ns 1.01
array/copy 18566 ns 18502.5 ns 1.00
array/iteration/findall/int 146514 ns 146476.5 ns 1.00
array/iteration/findall/bool 130975 ns 130795 ns 1.00
array/iteration/findfirst/int 84094.5 ns 84133 ns 1.00
array/iteration/findfirst/bool 81344 ns 81624.5 ns 1.00
array/iteration/scalar 65699 ns 65804 ns 1.00
array/iteration/logical 199014 ns 198187.5 ns 1.00
array/iteration/findmin/1d 86127.5 ns 86504 ns 1.00
array/iteration/findmin/2d 117201 ns 117154 ns 1.00
array/reductions/reduce/Int64/1d 38955.5 ns 41088.5 ns 0.95
array/reductions/reduce/Int64/dims=1 51296 ns 52190.5 ns 0.98
array/reductions/reduce/Int64/dims=2 59074 ns 59179 ns 1.00
array/reductions/reduce/Int64/dims=1L 87341 ns 87126 ns 1.00
array/reductions/reduce/Int64/dims=2L 84657.5 ns 84418.5 ns 1.00
array/reductions/reduce/Float32/1d 34039.5 ns 34001 ns 1.00
array/reductions/reduce/Float32/dims=1 49262 ns 39890 ns 1.23
array/reductions/reduce/Float32/dims=2 56573 ns 55899 ns 1.01
array/reductions/reduce/Float32/dims=1L 51645 ns 51535 ns 1.00
array/reductions/reduce/Float32/dims=2L 69784.5 ns 69798 ns 1.00
array/reductions/mapreduce/Int64/1d 38808 ns 40980.5 ns 0.95
array/reductions/mapreduce/Int64/dims=1 51392.5 ns 41741 ns 1.23
array/reductions/mapreduce/Int64/dims=2 58970 ns 59036 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 87347 ns 87134 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 84631.5 ns 84427 ns 1.00
array/reductions/mapreduce/Float32/1d 33862 ns 33457 ns 1.01
array/reductions/mapreduce/Float32/dims=1 49049 ns 48711 ns 1.01
array/reductions/mapreduce/Float32/dims=2 56551 ns 55941 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 51676 ns 51352 ns 1.01
array/reductions/mapreduce/Float32/dims=2L 69531 ns 68956 ns 1.01
array/broadcast 20650.5 ns 20251 ns 1.02
array/copyto!/gpu_to_gpu 10746.333333333334 ns 10684.333333333334 ns 1.01
array/copyto!/cpu_to_gpu 216903 ns 214898 ns 1.01
array/copyto!/gpu_to_cpu 283112 ns 281876 ns 1.00
array/accumulate/Int64/1d 118907 ns 118336 ns 1.00
array/accumulate/Int64/dims=1 79884 ns 79780 ns 1.00
array/accumulate/Int64/dims=2 157192 ns 155968.5 ns 1.01
array/accumulate/Int64/dims=1L 1707094 ns 1694089 ns 1.01
array/accumulate/Int64/dims=2L 961377 ns 960949 ns 1.00
array/accumulate/Float32/1d 101446.5 ns 100823 ns 1.01
array/accumulate/Float32/dims=1 76486 ns 76350 ns 1.00
array/accumulate/Float32/dims=2 144456.5 ns 144365 ns 1.00
array/accumulate/Float32/dims=1L 1585237 ns 1584729 ns 1.00
array/accumulate/Float32/dims=2L 657765 ns 656302 ns 1.00
array/construct 1277.9 ns 1283.1 ns 1.00
array/random/randn/Float32 43362 ns 36610 ns 1.18
array/random/randn!/Float32 30384 ns 30335 ns 1.00
array/random/rand!/Int64 29612 ns 26934 ns 1.10
array/random/rand!/Float32 8249.666666666666 ns 8186.666666666667 ns 1.01
array/random/rand/Int64 35367 ns 30201.5 ns 1.17
array/random/rand/Float32 12585 ns 12396 ns 1.02
array/permutedims/4d 51060.5 ns 52729 ns 0.97
array/permutedims/2d 52784.5 ns 52645 ns 1.00
array/permutedims/3d 53039 ns 53080 ns 1.00
array/sorting/1d 2735543 ns 2736443 ns 1.00
array/sorting/by 3305108.5 ns 3305811 ns 1.00
array/sorting/2d 1068212 ns 1071655.5 ns 1.00
cuda/synchronization/stream/auto 993.375 ns 1034.5263157894738 ns 0.96
cuda/synchronization/stream/nonblocking 7660.700000000001 ns 7705.9 ns 0.99
cuda/synchronization/stream/blocking 816.2019230769231 ns 784.4516129032259 ns 1.04
cuda/synchronization/context/auto 1146 ns 1133.5 ns 1.01
cuda/synchronization/context/nonblocking 7203.1 ns 7594.6 ns 0.95
cuda/synchronization/context/blocking 892.9795918367347 ns 885.6792452830189 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@codecov
Copy link

codecov bot commented Feb 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.48%. Comparing base (7a27d77) to head (337b7a7).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3030      +/-   ##
==========================================
+ Coverage   89.46%   89.48%   +0.01%     
==========================================
  Files         148      148              
  Lines       13047    13047              
==========================================
+ Hits        11673    11675       +2     
+ Misses       1374     1372       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant