Open
Conversation
Contributor
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/src/device/intrinsics/indexing.jl b/src/device/intrinsics/indexing.jl
index a42b003cd..36cde4ab9 100644
--- a/src/device/intrinsics/indexing.jl
+++ b/src/device/intrinsics/indexing.jl
@@ -66,32 +66,32 @@ end
"""
gridDim()::NamedTuple
-Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
+ Returns the dimensions of the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+ These dimensions have the same starting index as the `gridDim` built-in variable in the C/C++ extension.
"""
@inline gridDim() = (x=gridDim_x(), y=gridDim_y(), z=gridDim_z())
"""
blockIdx()::NamedTuple
-Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
+ Returns the block index within the grid as a `NamedTuple` with keys `x`, `y`, and `z`.
+ These indices are 1-based, unlike the `blockIdx` built-in variable in the C/C++ extension which is 0-based.
"""
@inline blockIdx() = (x=blockIdx_x(), y=blockIdx_y(), z=blockIdx_z())
"""
blockDim()::NamedTuple
-Returns the dimensions of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-These dimensions have the same starting index as the `blockDim` built-in variable in the C/C++ extension.
+ Returns the dimensions of the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+ These dimensions have the same starting index as the `blockDim` built-in variable in the C/C++ extension.
"""
@inline blockDim() = (x=blockDim_x(), y=blockDim_y(), z=blockDim_z())
"""
threadIdx()::NamedTuple
-Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
-These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
+ Returns the thread index within the block as a `NamedTuple` with keys `x`, `y`, and `z`.
+ These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++ extension which is 0-based.
"""
@inline threadIdx() = (x=threadIdx_x(), y=threadIdx_y(), z=threadIdx_z())
@@ -99,7 +99,7 @@ These indices are 1-based, unlike the `threadIdx` built-in variable in the C/C++
warpsize()::Int32
Returns the warp size (in threads).
-This corresponds to the `warpSize` built-in variable in the C/C++ extension.
+ This corresponds to the `warpSize` built-in variable in the C/C++ extension.
"""
@inline warpsize() = ccall("llvm.nvvm.read.ptx.sreg.warpsize", llvmcall, Int32, ())
@@ -107,7 +107,7 @@ This corresponds to the `warpSize` built-in variable in the C/C++ extension.
laneid()::Int32
Returns the thread's lane within the warp.
-This ID is 1-based.
+ This ID is 1-based.
"""
@inline laneid() = ccall("llvm.nvvm.read.ptx.sreg.laneid", llvmcall, Int32, ()) + 1i32
|
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 337b7a7 | Previous: 7a27d77 | Ratio |
|---|---|---|---|
latency/precompile |
44169430501.5 ns |
44455759835 ns |
0.99 |
latency/ttfp |
13133988338 ns |
13140153243 ns |
1.00 |
latency/import |
3764428041 ns |
3755312424 ns |
1.00 |
integration/volumerhs |
9435111.5 ns |
9442840 ns |
1.00 |
integration/byval/slices=1 |
145882 ns |
145598 ns |
1.00 |
integration/byval/slices=3 |
423338 ns |
422554 ns |
1.00 |
integration/byval/reference |
144101 ns |
143811 ns |
1.00 |
integration/byval/slices=2 |
284586 ns |
284011 ns |
1.00 |
integration/cudadevrt |
102648 ns |
102397 ns |
1.00 |
kernel/indexing |
13579 ns |
13434 ns |
1.01 |
kernel/indexing_checked |
14218 ns |
13908 ns |
1.02 |
kernel/occupancy |
647.9156626506024 ns |
644.5636363636364 ns |
1.01 |
kernel/launch |
2059.4 ns |
2090.3 ns |
0.99 |
kernel/rand |
14467 ns |
14479 ns |
1.00 |
array/reverse/1d |
19050 ns |
18661 ns |
1.02 |
array/reverse/2dL_inplace |
66409 ns |
66252 ns |
1.00 |
array/reverse/1dL |
69247 ns |
68893 ns |
1.01 |
array/reverse/2d |
21375 ns |
21087 ns |
1.01 |
array/reverse/1d_inplace |
10869.333333333334 ns |
10503.833333333332 ns |
1.03 |
array/reverse/2d_inplace |
11034 ns |
11399.5 ns |
0.97 |
array/reverse/2dL |
73430.5 ns |
73163 ns |
1.00 |
array/reverse/1dL_inplace |
66552 ns |
66146 ns |
1.01 |
array/copy |
18566 ns |
18502.5 ns |
1.00 |
array/iteration/findall/int |
146514 ns |
146476.5 ns |
1.00 |
array/iteration/findall/bool |
130975 ns |
130795 ns |
1.00 |
array/iteration/findfirst/int |
84094.5 ns |
84133 ns |
1.00 |
array/iteration/findfirst/bool |
81344 ns |
81624.5 ns |
1.00 |
array/iteration/scalar |
65699 ns |
65804 ns |
1.00 |
array/iteration/logical |
199014 ns |
198187.5 ns |
1.00 |
array/iteration/findmin/1d |
86127.5 ns |
86504 ns |
1.00 |
array/iteration/findmin/2d |
117201 ns |
117154 ns |
1.00 |
array/reductions/reduce/Int64/1d |
38955.5 ns |
41088.5 ns |
0.95 |
array/reductions/reduce/Int64/dims=1 |
51296 ns |
52190.5 ns |
0.98 |
array/reductions/reduce/Int64/dims=2 |
59074 ns |
59179 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
87341 ns |
87126 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84657.5 ns |
84418.5 ns |
1.00 |
array/reductions/reduce/Float32/1d |
34039.5 ns |
34001 ns |
1.00 |
array/reductions/reduce/Float32/dims=1 |
49262 ns |
39890 ns |
1.23 |
array/reductions/reduce/Float32/dims=2 |
56573 ns |
55899 ns |
1.01 |
array/reductions/reduce/Float32/dims=1L |
51645 ns |
51535 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
69784.5 ns |
69798 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
38808 ns |
40980.5 ns |
0.95 |
array/reductions/mapreduce/Int64/dims=1 |
51392.5 ns |
41741 ns |
1.23 |
array/reductions/mapreduce/Int64/dims=2 |
58970 ns |
59036 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
87347 ns |
87134 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
84631.5 ns |
84427 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
33862 ns |
33457 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1 |
49049 ns |
48711 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=2 |
56551 ns |
55941 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
51676 ns |
51352 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=2L |
69531 ns |
68956 ns |
1.01 |
array/broadcast |
20650.5 ns |
20251 ns |
1.02 |
array/copyto!/gpu_to_gpu |
10746.333333333334 ns |
10684.333333333334 ns |
1.01 |
array/copyto!/cpu_to_gpu |
216903 ns |
214898 ns |
1.01 |
array/copyto!/gpu_to_cpu |
283112 ns |
281876 ns |
1.00 |
array/accumulate/Int64/1d |
118907 ns |
118336 ns |
1.00 |
array/accumulate/Int64/dims=1 |
79884 ns |
79780 ns |
1.00 |
array/accumulate/Int64/dims=2 |
157192 ns |
155968.5 ns |
1.01 |
array/accumulate/Int64/dims=1L |
1707094 ns |
1694089 ns |
1.01 |
array/accumulate/Int64/dims=2L |
961377 ns |
960949 ns |
1.00 |
array/accumulate/Float32/1d |
101446.5 ns |
100823 ns |
1.01 |
array/accumulate/Float32/dims=1 |
76486 ns |
76350 ns |
1.00 |
array/accumulate/Float32/dims=2 |
144456.5 ns |
144365 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1585237 ns |
1584729 ns |
1.00 |
array/accumulate/Float32/dims=2L |
657765 ns |
656302 ns |
1.00 |
array/construct |
1277.9 ns |
1283.1 ns |
1.00 |
array/random/randn/Float32 |
43362 ns |
36610 ns |
1.18 |
array/random/randn!/Float32 |
30384 ns |
30335 ns |
1.00 |
array/random/rand!/Int64 |
29612 ns |
26934 ns |
1.10 |
array/random/rand!/Float32 |
8249.666666666666 ns |
8186.666666666667 ns |
1.01 |
array/random/rand/Int64 |
35367 ns |
30201.5 ns |
1.17 |
array/random/rand/Float32 |
12585 ns |
12396 ns |
1.02 |
array/permutedims/4d |
51060.5 ns |
52729 ns |
0.97 |
array/permutedims/2d |
52784.5 ns |
52645 ns |
1.00 |
array/permutedims/3d |
53039 ns |
53080 ns |
1.00 |
array/sorting/1d |
2735543 ns |
2736443 ns |
1.00 |
array/sorting/by |
3305108.5 ns |
3305811 ns |
1.00 |
array/sorting/2d |
1068212 ns |
1071655.5 ns |
1.00 |
cuda/synchronization/stream/auto |
993.375 ns |
1034.5263157894738 ns |
0.96 |
cuda/synchronization/stream/nonblocking |
7660.700000000001 ns |
7705.9 ns |
0.99 |
cuda/synchronization/stream/blocking |
816.2019230769231 ns |
784.4516129032259 ns |
1.04 |
cuda/synchronization/context/auto |
1146 ns |
1133.5 ns |
1.01 |
cuda/synchronization/context/nonblocking |
7203.1 ns |
7594.6 ns |
0.95 |
cuda/synchronization/context/blocking |
892.9795918367347 ns |
885.6792452830189 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3030 +/- ##
==========================================
+ Coverage 89.46% 89.48% +0.01%
==========================================
Files 148 148
Lines 13047 13047
==========================================
+ Hits 11673 11675 +2
+ Misses 1374 1372 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I had some...uhm...fun in the last couple of days trying to port some C++ CUDA code to CUDA.jl, and profile it. I dumped into this PR my experience, hoping to make lives of people after me a little bit easier 🙂