Skip to content

Roadmap #2

@l1cacheDell

Description

@l1cacheDell

CPU impl

1. TODO

  1. SHA256-SIMD version (Lei Hao)
  2. Benchmark:
    1. Correntness (compared with std lib)
    2. Performance: SHA256 vs SHA256-SIMD vs BLAKE3 vs BLAKE3-threading vs BLAKE3-threading-SIMD

2. WIP

  1. BLAKE3 SIMD version (AVX2 instruction)
    1. Threading & SIMD
      1. Compute-bound -> the same threads count as our CPU cores (TIPS)

3. Done

  1. SHA256 basic impl
  2. BLAKE3 basic impl
  3. BLAKE3 multithreading

GPU impl

1. TODO

  1. SM80's cp.async to reduce pipeline bubbles
  2. Support SM90 arch
  3. Performance benchmarking
    1. Different kernel version on different arch (SM70, SM80, SM90) x (v1, v2, v3)
    2. Latest kernel performance among different arch (SM70, SM80, SM90) x (latest_version)

2. WIP

  1. SM80's cp.async to reduce pipeline bubbles

3. Done

  1. Basic kernel impl
  2. Coalsced Memory access + Staging pipeline
    1. Stage 1: Coalsced Loading from gmem
    2. Stage 2: Compress chunk to roots, and merge to one warp-level cv
    3. Stage 3: Block Reduce, yield one block-level cv
  3. Parallel computing logic - 16-lane sub-warp for chunk compressing instead of multiple inactivate lanes
    1. Improved computation throughput from 67% to 70%
  4. Involve CuTe with layouts for gmem and smem, to help solve data loading (Stage 1)
  5. Debug the basic GPU computing logic, make sure no wrong output

Other TODO

  1. Demo-video
  2. Report (Overleaf via email)

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions