A high-performance Rust solution for the One Billion Row Challenge, designed to process 1 billion temperature measurements as fast as possible.
- Current: ~1.5 seconds on Apple M4 Max (10P + 4E cores)
- File size: ~15.9 GB (1 billion rows)
- Throughput: ~10.6 GB/s
flowchart TB
subgraph Input
F[measurements.txt<br/>15.9 GB]
end
subgraph Memory Mapping
M[Memory-Mapped File<br/>mmap + Sequential Advice]
end
subgraph Parallel Processing
C1[Chunk 1]
C2[Chunk 2]
C3[Chunk ...]
CN[Chunk N]
end
subgraph Per-Thread Processing
P1[FastTable 1]
P2[FastTable 2]
P3[FastTable ...]
PN[FastTable N]
end
subgraph Merge Phase
R[Reduce/Merge<br/>All FastTables]
end
subgraph Output
S[Sort by Station Name]
O[Formatted Output<br/>stdout]
end
F --> M
M --> C1 & C2 & C3 & CN
C1 --> P1
C2 --> P2
C3 --> P3
CN --> PN
P1 & P2 & P3 & PN --> R
R --> S --> O
sequenceDiagram
participant Main
participant Rayon as Rayon ThreadPool
participant Chunk as Chunk Processor
participant Table as FastTable
Main->>Main: Open file & create mmap
Main->>Main: Set Sequential read advice
Main->>Main: Calculate chunk boundaries
Main->>Rayon: Spawn parallel iterators
par For each chunk
Rayon->>Chunk: process_chunk(data)
loop For each line
Chunk->>Chunk: Find semicolon (memchr SIMD)
Chunk->>Chunk: Extract station name (zero-copy)
Chunk->>Chunk: Hash name (xxHash3)
Chunk->>Chunk: Parse temperature (inline unsafe)
Chunk->>Table: update(name, hash, value)
end
Chunk-->>Rayon: Return FastTable
end
Rayon->>Main: Reduce (merge all tables)
Main->>Main: Sort results
Main->>Main: Write to stdout
Stores aggregated statistics for each weather station:
struct StationStats<'a> {
name: &'a [u8], // Reference to mmap (zero-copy)
hash: u64, // Cached hash for lookup
min: i32, // Minimum temp (scaled by 10)
max: i32, // Maximum temp (scaled by 10)
sum: i64, // Sum for average calculation
count: u32, // Number of readings
}Custom open-addressed hash table optimized for this workload:
flowchart LR
subgraph FastTable["FastTable (65,536 slots)"]
direction TB
S0["Slot 0: None"]
S1["Slot 1: Some(Stats)"]
S2["Slot 2: None"]
S3["Slot 3: Some(Stats)"]
SN["..."]
end
H[hash & MASK] --> |index| FastTable
subgraph Probing["Linear Probing"]
P1["1. Check slot[idx]"]
P2["2. Compare hash only (fast)"]
P3["3. If collision: idx = (idx+1) & MASK"]
end
Key features:
- Size: 65,536 slots (2^16, power of 2 for fast modulo via bitmask)
- Collision resolution: Linear probing
- Lookup: Hash-only comparison (no string comparison for speed)
- Load factor: ~15% (10K stations / 65K slots) - minimal collisions
flowchart LR
subgraph Input["Raw Bytes"]
D["London;15.2\nParis;-3.4\nTokyo;22.1\n..."]
end
subgraph Parse["Parsing Steps"]
S1["1. memchr(';') - NEON SIMD"]
S2["2. Extract name slice"]
S3["3. xxh3_64(name)"]
S4["4. Parse temp (unsafe)"]
end
subgraph Update["Hash Table Update"]
U["FastTable.update()<br/>Hash-only lookup"]
end
D --> S1 --> S2 --> S3 --> S4 --> U
Optimized inline parser handles formats: X.X, -X.X, XX.X, -XX.X
flowchart TD
I[Input bytes after ';'] --> N{First byte == '-'?}
N -->|Yes| NEG[negative = true<br/>skip 1 byte]
N -->|No| POS[negative = false]
NEG --> F
POS --> F
F{Second byte == '.'?}
F -->|Yes| X1["X.X format<br/>d0*10 + d1"]
F -->|No| X2["XX.X format<br/>d0*100 + d1*10 + d2"]
X1 --> BR["if neg { val = -val }"]
X2 --> BR
Each thread must process complete lines. Boundary alignment ensures no line is split:
flowchart LR
subgraph File["Memory-Mapped File"]
direction LR
B1["Chunk 1 boundary"]
B2["Chunk 2 boundary"]
B3["Chunk 3 boundary"]
end
subgraph Alignment["Alignment Strategy"]
A1["Thread 0: Start at 0"]
A2["Thread N: Find next '\\n' after boundary"]
A3["End: Find next '\\n' after chunk end"]
end
B1 --> A1
B2 --> A2
B3 --> A3
| Crate | Purpose |
|---|---|
memmap2 |
Memory-mapped file I/O with advisory hints |
rayon |
Data parallelism via work-stealing |
xxhash-rust |
xxHash3 - fast non-cryptographic hashing |
memchr |
SIMD-optimized byte search (NEON on ARM) |
# Build with native CPU optimizations
RUSTFLAGS="-C target-cpu=native" cargo build --release
# Run (first run warms file cache)
time ./target/release/onebrc > /dev/null
time ./target/release/onebrc > /dev/null # Use this timingStationName;Temperature
London;15.2
Paris;-3.4
Tokyo;22.1
...
{Aadorf=-99.9/0.3/99.9, Aalen=-99.9/-0.4/99.9, ...}
Format: StationName=min/avg/max sorted alphabetically.
- Memory Mapping - Zero-copy file access via
mmap - Sequential Advice - Kernel read-ahead optimization
- Parallel Processing - Work-stealing via Rayon (14 threads on M4 Max)
- Custom Hash Table - 65K slots, hash-only comparison
- Zero-Copy Strings - Station names reference mmap directly
- SIMD Search -
memchruses NEON for semicolon search - xxHash3 - Fast hashing optimized for short strings
- Inline Unsafe Parsing - Temperature parsed with raw pointer arithmetic
- Integer Arithmetic - Temperatures scaled by 10 to avoid floats
- Minimal Branching - Predictable branches in hot path
| Component | % Time | Notes |
|---|---|---|
| xxHash3 hashing | ~57% | Station name hashing |
| memchr search | ~22% | Finding semicolons |
| FastTable update | ~18% | Hash table operations |
| Other | ~3% | Merge, sort, output |
During development, multiple hash functions were tested:
| Hasher | Time | Notes |
|---|---|---|
| xxHash3 | ~1.5s | Winner - optimized for short strings |
| ahash | ~1.5s | Similar performance, more complex API |
| Highway | ~3.3s | Too much setup overhead for short strings |
| FxHash | ~3.7s | Poor distribution caused collisions |
src/main.rs
├── StationStats - Per-station aggregated data
├── FastTable - Open-addressed hash table
│ ├── new() - Initialize with 65K slots
│ ├── update() - Insert/update station (hash-only lookup)
│ ├── merge() - Combine tables from parallel workers
│ └── merge_entry() - Merge single entry
├── main() - Entry point, orchestrates parallel processing
├── find_next_newline() - Chunk boundary alignment
└── process_chunk() - Core parsing loop (memchr + xxHash3 + parse)
- SWAR (SIMD Within A Register) for semicolon search
- Compute hash while scanning for semicolon (overlap work)
- Prefetching hints for hash table access
- Custom allocator for hash table entries