-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
CPU impl
1. TODO
SHA256-SIMD version(Lei Hao)- Benchmark:
- Correntness (compared with std lib)
- Performance: SHA256 vs SHA256-SIMD vs BLAKE3 vs BLAKE3-threading vs BLAKE3-threading-SIMD
2. WIP
- BLAKE3 SIMD version (AVX2 instruction)
- Threading & SIMD
- Compute-bound -> the same threads count as our CPU cores (TIPS)
- Threading & SIMD
3. Done
- SHA256 basic impl
- BLAKE3 basic impl
- BLAKE3 multithreading
GPU impl
1. TODO
- SM80's
cp.asyncto reduce pipeline bubbles - Support SM90 arch
- Performance benchmarking
- Different kernel version on different arch (SM70, SM80, SM90) x (v1, v2, v3)
- Latest kernel performance among different arch (SM70, SM80, SM90) x (latest_version)
2. WIP
- SM80's
cp.asyncto reduce pipeline bubbles
3. Done
- Basic kernel impl
- Coalsced Memory access + Staging pipeline
- Stage 1: Coalsced Loading from
gmem - Stage 2: Compress chunk to roots, and merge to one warp-level
cv - Stage 3: Block Reduce, yield one block-level
cv
- Stage 1: Coalsced Loading from
- Parallel computing logic -
16-lane sub-warp for chunk compressinginstead of multiple inactivate lanes- Improved computation throughput from 67% to 70%
- Involve
CuTewith layouts forgmemandsmem, to help solve data loading (Stage 1) - Debug the basic GPU computing logic, make sure no wrong output
Other TODO
- Demo-video
- Report (Overleaf via email)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels