Optimize allocation, lockless accumulation, and tape flushing #351

Aethdv · 2026-01-09T20:24:56Z

Greatly improve the evaltune system by redesigning part of the process.

Startup QoL and speedup: Pre-scan FEN files to count lines. Reserve positions and results vectors exact size upfront to prevent reallocations.
Huge Pages (neutral): Apply MADV_HUGEPAGE to the reserved capacity. (measured no speedup, but including this doesn't hurt).
Concurrency improvement: Replace mutex-based gradient accumulation with per-thread buffers.
Optimized Reduction: Use std::barrier to sum thread gradients and apply optimizer steps, removing lock contention.
Reintroduce Microbatching: Process positions in chunks of 160 to flush the autograd tape (cleanup/zero_grad) frequently. This value was obtained by trial and error to maximize speed.

Noted from baseline 10s -> 8.5s -> 5s per epoch improvement on Ryzen 7 7735hs.
and fixed the memory usage exceeding the expected estimation of ~5.2G, in my case it got up to 7.7G (stabilized) and on giovok's machine it could reach up to +20GB when loading and stabilizing at around 11GB.

- **Startup**: Pre-scan FEN files to count lines. Reserve and vectors exact size upfront to prevent reallocations. - **Huge Pages**: Apply to the reserved capacity. (measured no speedup) - **Concurrency**: Replace mutex-based gradient accumulation with per-thread buffers. - **Reduction**: Use to sum thread gradients and apply optimizer steps, removing lock contention. - **Tape Management**: Process positions in chunks of 1024 to flush the autograd tape (/) frequently.

Bench: 13922193

Aethdv and others added 5 commits January 9, 2026 15:31

Merge branch 'official-clockwork:main' into evaltune-optm

87daf47

nits & warning fixes

6c8828f

Huge speedup by tuning the microbatch size

ec52fd1

Tuned values

7bc17ef

Bench: 13922193

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize allocation, lockless accumulation, and tape flushing #351

Optimize allocation, lockless accumulation, and tape flushing #351

Aethdv commented Jan 9, 2026 •

edited by TheRealGioviok

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize allocation, lockless accumulation, and tape flushing #351

Are you sure you want to change the base?

Optimize allocation, lockless accumulation, and tape flushing #351

Conversation

Aethdv commented Jan 9, 2026 • edited by TheRealGioviok Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aethdv commented Jan 9, 2026 •

edited by TheRealGioviok

Loading