make training kernels faster by using shared memory by mahdizaferanchi · Pull Request #15 · Felix-Petersen/difflogic

mahdizaferanchi · 2024-01-20T16:55:20Z

Used CUDA block-level shared memory to optimize training kernels.

Find tests and results here.

… of bit count add __repr__ to PackBitsTensor

Felix-Petersen · 2024-03-19T01:16:15Z

Thanks for the additions! These optimizations restructure the code quite a bit and require a constant which could depend on the GPU model. I saw you showed improvements for a P100, but before merging it, I want to test it myself on some other more recent and diverse GPU models.

That being said, I very much appreciate the improvements, and I intend to do these tests and integrate the block-level shared memory for the next version release (probably later this year).

mahdizaferanchi · 2024-03-19T08:08:55Z

That sounds great! Yes, I understand the decision to merge these changes isn't trivial. I'm glad my work could be helpful.

mahdizaferanchi added 3 commits January 20, 2024 19:39

make training kernels faster by using shared memory

863182b

fix tensor_packbits_cuda_kernel for when batch size is not a multiple…

47c6954

… of bit count add __repr__ to PackBitsTensor

Merge branch 'fix_pack_bits' into main

d6c55e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make training kernels faster by using shared memory#15

make training kernels faster by using shared memory#15
mahdizaferanchi wants to merge 3 commits intoFelix-Petersen:mainfrom
mahdizaferanchi:main

mahdizaferanchi commented Jan 20, 2024

Uh oh!

Felix-Petersen commented Mar 19, 2024

Uh oh!

mahdizaferanchi commented Mar 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mahdizaferanchi commented Jan 20, 2024

Uh oh!

Felix-Petersen commented Mar 19, 2024

Uh oh!

mahdizaferanchi commented Mar 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants