Skip to content

Comments

Optimize chunker hot loop for ~1.5-2x throughput#307

Merged
folbricht merged 1 commit intomasterfrom
chunker-benchmarks
Feb 8, 2026
Merged

Optimize chunker hot loop for ~1.5-2x throughput#307
folbricht merged 1 commit intomasterfrom
chunker-benchmarks

Conversation

@folbricht
Copy link
Owner

@folbricht folbricht commented Feb 8, 2026

Summary

  • Replace expensive DIVL instruction in boundary detection (hValue % disc == disc-1) with Lemire's fast divisibility test — a multiply-and-compare that costs ~5 cycles vs ~26 for hardware division
  • Convert hashTable from slice to [256]uint32 fixed-size array, eliminating bounds checks on all accesses
  • Add precomputed hashTableRotated table, removing one RotateLeft32 call per byte in the hot loop
  • Hoist struct fields (hValue, hIdx, buf, window pointer, Lemire constants) into local variables to avoid repeated pointer dereferences
  • Replace (hIdx + 1) % 48 with a branch (if hIdx >= 48) — perfectly predicted, taken once per 48 iterations
  • Reuse backing buffer in fillBuffer() to reduce allocations from O(filesize/buffersize) to O(1)
  • Add b.SetBytes() to benchmarks for direct MB/s throughput reporting; modernize to use b.Loop()

All optimizations preserve identical chunk boundaries, verified by TestChunkerLargeFile which checks exact SHA512/256 hashes for every chunk.

Test plan

  • go test -run TestChunkerLargeFile — exact chunk boundary regression (SHA512/256 hashes)
  • go test -run TestChunker — all chunker tests pass
  • go test -bench=BenchmarkChunk -benchmem — verify throughput improvement
  • CI passes on Linux, macOS, Windows

🤖 Generated with Claude Code

Before:

goos: linux
goarch: amd64
pkg: github.com/folbricht/desync
cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
BenchmarkChunker-4         	     272	   4304447 ns/op	 2621489 B/op	       2 allocs/op
BenchmarkChunkNull1M-4     	     205	   5907772 ns/op	 2621489 B/op	       2 allocs/op
BenchmarkChunkNull10M-4    	      15	  69146171 ns/op	13107254 B/op	       6 allocs/op
BenchmarkChunkNull50M-4    	       4	 294544244 ns/op	55050312 B/op	      23 allocs/op
BenchmarkChunkNull100M-4   	       2	 593006637 ns/op	107479112 B/op	      43 allocs/op

After:

goos: linux
goarch: amd64
pkg: github.com/folbricht/desync
cpu: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
BenchmarkChunker-4         	     472	   2371186 ns/op	 442.22 MB/s	 2621489 B/op	       2 allocs/op
BenchmarkChunkNull1M-4     	     376	   3125929 ns/op	 335.44 MB/s	 2621490 B/op	       2 allocs/op
BenchmarkChunkNull10M-4    	      38	  31382746 ns/op	 334.12 MB/s	 2621489 B/op	       2 allocs/op
BenchmarkChunkNull50M-4    	       7	 153479795 ns/op	 341.60 MB/s	 2621488 B/op	       2 allocs/op
BenchmarkChunkNull100M-4   	       4	 307347136 ns/op	 341.17 MB/s	 2621488 B/op	       2 allocs/op

Closes #244

Replace the two main bottlenecks in the chunker's byte-processing loop:
the expensive DIVL instruction for boundary detection (~26 cycles/byte)
and the modulo-48 window index (~14 instructions/byte).

Key changes:
- Replace modulo boundary check with Lemire's fast divisibility test
  (multiply-and-compare, ~5 cycles vs ~26 for hardware division)
- Convert hashTable from slice to [256]uint32 array (eliminates bounds checks)
- Add precomputed hashTableRotated table (removes one RotateLeft32 per byte)
- Hoist struct fields into local variables in the hot loop
- Replace modulo-48 window index with branch (predicted once per 48 iterations)
- Reuse backing buffer in fillBuffer() to reduce allocations to O(1)
- Add b.SetBytes() to benchmarks for direct MB/s reporting
- Modernize benchmarks to use b.Loop() and bytes.NewReader

All optimizations preserve identical chunk boundaries, verified by
TestChunkerLargeFile which checks exact SHA512/256 hashes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@folbricht folbricht merged commit f7c780b into master Feb 8, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

borg is faster than "desync make" despite borg is single-threaded

1 participant