Skip to content

Conversation

@Aethdv
Copy link
Contributor

@Aethdv Aethdv commented Jan 9, 2026

Greatly improve the evaltune system by redesigning part of the process.

  • Startup QoL and speedup: Pre-scan FEN files to count lines. Reserve positions and results vectors exact size upfront to prevent reallocations.
  • Huge Pages (neutral): Apply MADV_HUGEPAGE to the reserved capacity. (measured no speedup, but including this doesn't hurt).
  • Concurrency improvement: Replace mutex-based gradient accumulation with per-thread buffers.
  • Optimized Reduction: Use std::barrier to sum thread gradients and apply optimizer steps, removing lock contention.
  • Reintroduce Microbatching: Process positions in chunks of 160 to flush the autograd tape (cleanup/zero_grad) frequently. This value was obtained by trial and error to maximize speed.

Noted from baseline 10s -> 8.5s -> 5s per epoch improvement on Ryzen 7 7735hs.
and fixed the memory usage exceeding the expected estimation of ~5.2G, in my case it got up to 7.7G (stabilized) and on giovok's machine it could reach up to +20GB when loading and stabilizing at around 11GB.

Aethdv and others added 5 commits January 9, 2026 15:31
- **Startup**: Pre-scan FEN files to count lines. Reserve  and  vectors exact size upfront to prevent reallocations.
- **Huge Pages**: Apply  to the reserved capacity. (measured no speedup)
- **Concurrency**: Replace mutex-based gradient accumulation with per-thread buffers.
- **Reduction**: Use  to sum thread gradients and apply optimizer steps, removing lock contention.
- **Tape Management**: Process positions in chunks of 1024 to flush the autograd tape (/) frequently.
Bench: 13922193
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants