Optimize chunker hot loop for ~1.5-2x throughput#307
Merged
Conversation
Replace the two main bottlenecks in the chunker's byte-processing loop: the expensive DIVL instruction for boundary detection (~26 cycles/byte) and the modulo-48 window index (~14 instructions/byte). Key changes: - Replace modulo boundary check with Lemire's fast divisibility test (multiply-and-compare, ~5 cycles vs ~26 for hardware division) - Convert hashTable from slice to [256]uint32 array (eliminates bounds checks) - Add precomputed hashTableRotated table (removes one RotateLeft32 per byte) - Hoist struct fields into local variables in the hot loop - Replace modulo-48 window index with branch (predicted once per 48 iterations) - Reuse backing buffer in fillBuffer() to reduce allocations to O(1) - Add b.SetBytes() to benchmarks for direct MB/s reporting - Modernize benchmarks to use b.Loop() and bytes.NewReader All optimizations preserve identical chunk boundaries, verified by TestChunkerLargeFile which checks exact SHA512/256 hashes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DIVLinstruction in boundary detection (hValue % disc == disc-1) with Lemire's fast divisibility test — a multiply-and-compare that costs ~5 cycles vs ~26 for hardware divisionhashTablefrom slice to[256]uint32fixed-size array, eliminating bounds checks on all accesseshashTableRotatedtable, removing oneRotateLeft32call per byte in the hot loophValue,hIdx,buf, window pointer, Lemire constants) into local variables to avoid repeated pointer dereferences(hIdx + 1) % 48with a branch (if hIdx >= 48) — perfectly predicted, taken once per 48 iterationsfillBuffer()to reduce allocations from O(filesize/buffersize) to O(1)b.SetBytes()to benchmarks for direct MB/s throughput reporting; modernize to useb.Loop()All optimizations preserve identical chunk boundaries, verified by
TestChunkerLargeFilewhich checks exact SHA512/256 hashes for every chunk.Test plan
go test -run TestChunkerLargeFile— exact chunk boundary regression (SHA512/256 hashes)go test -run TestChunker— all chunker tests passgo test -bench=BenchmarkChunk -benchmem— verify throughput improvement🤖 Generated with Claude Code
Before:
After:
Closes #244