Optimize allocation, lockless accumulation, and tape flushing #351
+187
−115
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Greatly improve the evaltune system by redesigning part of the process.
positionsandresultsvectors exact size upfront to prevent reallocations.MADV_HUGEPAGEto the reserved capacity. (measured no speedup, but including this doesn't hurt).std::barrierto sum thread gradients and apply optimizer steps, removing lock contention.cleanup/zero_grad) frequently. This value was obtained by trial and error to maximize speed.Noted from baseline 10s -> 8.5s -> 5s per epoch improvement on Ryzen 7 7735hs.
and fixed the memory usage exceeding the expected estimation of ~5.2G, in my case it got up to 7.7G (stabilized) and on giovok's machine it could reach up to +20GB when loading and stabilizing at around 11GB.