One Billion Row Challenge (1BRC) - Rust Implementation

A high-performance Rust solution for the One Billion Row Challenge, designed to process 1 billion temperature measurements as fast as possible.

Performance

Current: ~1.5 seconds on Apple M4 Max (10P + 4E cores)
File size: ~15.9 GB (1 billion rows)
Throughput: ~10.6 GB/s

Architecture Overview

flowchart TB
    subgraph Input
        F[measurements.txt<br/>15.9 GB]
    end

    subgraph Memory Mapping
        M[Memory-Mapped File<br/>mmap + Sequential Advice]
    end

    subgraph Parallel Processing
        C1[Chunk 1]
        C2[Chunk 2]
        C3[Chunk ...]
        CN[Chunk N]
    end

    subgraph Per-Thread Processing
        P1[FastTable 1]
        P2[FastTable 2]
        P3[FastTable ...]
        PN[FastTable N]
    end

    subgraph Merge Phase
        R[Reduce/Merge<br/>All FastTables]
    end

    subgraph Output
        S[Sort by Station Name]
        O[Formatted Output<br/>stdout]
    end

    F --> M
    M --> C1 & C2 & C3 & CN
    C1 --> P1
    C2 --> P2
    C3 --> P3
    CN --> PN
    P1 & P2 & P3 & PN --> R
    R --> S --> O

Data Flow

sequenceDiagram
    participant Main
    participant Rayon as Rayon ThreadPool
    participant Chunk as Chunk Processor
    participant Table as FastTable

    Main->>Main: Open file & create mmap
    Main->>Main: Set Sequential read advice
    Main->>Main: Calculate chunk boundaries
    Main->>Rayon: Spawn parallel iterators
    
    par For each chunk
        Rayon->>Chunk: process_chunk(data)
        loop For each line
            Chunk->>Chunk: Find semicolon (memchr SIMD)
            Chunk->>Chunk: Extract station name (zero-copy)
            Chunk->>Chunk: Hash name (xxHash3)
            Chunk->>Chunk: Parse temperature (inline unsafe)
            Chunk->>Table: update(name, hash, value)
        end
        Chunk-->>Rayon: Return FastTable
    end

    Rayon->>Main: Reduce (merge all tables)
    Main->>Main: Sort results
    Main->>Main: Write to stdout

Core Components

1. StationStats

Stores aggregated statistics for each weather station:

struct StationStats<'a> {
    name: &'a [u8],  // Reference to mmap (zero-copy)
    hash: u64,       // Cached hash for lookup
    min: i32,        // Minimum temp (scaled by 10)
    max: i32,        // Maximum temp (scaled by 10)
    sum: i64,        // Sum for average calculation
    count: u32,      // Number of readings
}

2. FastTable

Custom open-addressed hash table optimized for this workload:

flowchart LR
    subgraph FastTable["FastTable (65,536 slots)"]
        direction TB
        S0["Slot 0: None"]
        S1["Slot 1: Some(Stats)"]
        S2["Slot 2: None"]
        S3["Slot 3: Some(Stats)"]
        SN["..."]
    end

    H[hash & MASK] --> |index| FastTable
    
    subgraph Probing["Linear Probing"]
        P1["1. Check slot[idx]"]
        P2["2. Compare hash only (fast)"]
        P3["3. If collision: idx = (idx+1) & MASK"]
    end

Key features:

Size: 65,536 slots (2^16, power of 2 for fast modulo via bitmask)
Collision resolution: Linear probing
Lookup: Hash-only comparison (no string comparison for speed)
Load factor: ~15% (10K stations / 65K slots) - minimal collisions

3. Chunk Processing Pipeline

flowchart LR
    subgraph Input["Raw Bytes"]
        D["London;15.2\nParis;-3.4\nTokyo;22.1\n..."]
    end

    subgraph Parse["Parsing Steps"]
        S1["1. memchr(';') - NEON SIMD"]
        S2["2. Extract name slice"]
        S3["3. xxh3_64(name)"]
        S4["4. Parse temp (unsafe)"]
    end

    subgraph Update["Hash Table Update"]
        U["FastTable.update()<br/>Hash-only lookup"]
    end

    D --> S1 --> S2 --> S3 --> S4 --> U

4. Temperature Parsing

Optimized inline parser handles formats: X.X, -X.X, XX.X, -XX.X

flowchart TD
    I[Input bytes after ';'] --> N{First byte == '-'?}
    N -->|Yes| NEG[negative = true<br/>skip 1 byte]
    N -->|No| POS[negative = false]
    NEG --> F
    POS --> F
    F{Second byte == '.'?}
    F -->|Yes| X1["X.X format<br/>d0*10 + d1"]
    F -->|No| X2["XX.X format<br/>d0*100 + d1*10 + d2"]
    X1 --> BR["if neg { val = -val }"]
    X2 --> BR

Chunk Boundary Alignment

Each thread must process complete lines. Boundary alignment ensures no line is split:

flowchart LR
    subgraph File["Memory-Mapped File"]
        direction LR
        B1["Chunk 1 boundary"]
        B2["Chunk 2 boundary"]
        B3["Chunk 3 boundary"]
    end

    subgraph Alignment["Alignment Strategy"]
        A1["Thread 0: Start at 0"]
        A2["Thread N: Find next '\\n' after boundary"]
        A3["End: Find next '\\n' after chunk end"]
    end

    B1 --> A1
    B2 --> A2
    B3 --> A3

Dependencies

Crate	Purpose
`memmap2`	Memory-mapped file I/O with advisory hints
`rayon`	Data parallelism via work-stealing
`xxhash-rust`	xxHash3 - fast non-cryptographic hashing
`memchr`	SIMD-optimized byte search (NEON on ARM)

Build & Run

# Build with native CPU optimizations
RUSTFLAGS="-C target-cpu=native" cargo build --release

# Run (first run warms file cache)
time ./target/release/onebrc > /dev/null
time ./target/release/onebrc > /dev/null  # Use this timing

Input Format

StationName;Temperature
London;15.2
Paris;-3.4
Tokyo;22.1
...

Output Format

{Aadorf=-99.9/0.3/99.9, Aalen=-99.9/-0.4/99.9, ...}

Format: StationName=min/avg/max sorted alphabetically.

Optimization Techniques Used

Memory Mapping - Zero-copy file access via mmap
Sequential Advice - Kernel read-ahead optimization
Parallel Processing - Work-stealing via Rayon (14 threads on M4 Max)
Custom Hash Table - 65K slots, hash-only comparison
Zero-Copy Strings - Station names reference mmap directly
SIMD Search - memchr uses NEON for semicolon search
xxHash3 - Fast hashing optimized for short strings
Inline Unsafe Parsing - Temperature parsed with raw pointer arithmetic
Integer Arithmetic - Temperatures scaled by 10 to avoid floats
Minimal Branching - Predictable branches in hot path

Performance Breakdown (from profiling)

Component	% Time	Notes
xxHash3 hashing	~57%	Station name hashing
memchr search	~22%	Finding semicolons
FastTable update	~18%	Hash table operations
Other	~3%	Merge, sort, output

Hasher Comparison

During development, multiple hash functions were tested:

Hasher	Time	Notes
xxHash3	~1.5s	Winner - optimized for short strings
ahash	~1.5s	Similar performance, more complex API
Highway	~3.3s	Too much setup overhead for short strings
FxHash	~3.7s	Poor distribution caused collisions

Code Structure

src/main.rs
├── StationStats      - Per-station aggregated data
├── FastTable         - Open-addressed hash table
│   ├── new()         - Initialize with 65K slots
│   ├── update()      - Insert/update station (hash-only lookup)
│   ├── merge()       - Combine tables from parallel workers
│   └── merge_entry() - Merge single entry
├── main()            - Entry point, orchestrates parallel processing
├── find_next_newline() - Chunk boundary alignment
└── process_chunk()   - Core parsing loop (memchr + xxHash3 + parse)

Future Optimizations

SWAR (SIMD Within A Register) for semicolon search
Compute hash while scanning for semicolon (overlap work)
Prefetching hints for hash table access
Custom allocator for hash table entries

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

One Billion Row Challenge (1BRC) - Rust Implementation

Performance

Architecture Overview

Data Flow

Core Components

1. StationStats

2. FastTable

3. Chunk Processing Pipeline

4. Temperature Parsing

Chunk Boundary Alignment

Dependencies

Build & Run

Input Format

Output Format

Optimization Techniques Used

Performance Breakdown (from profiling)

Hasher Comparison

Code Structure

Future Optimizations

About

Uh oh!

Releases

Packages

Languages

bsayed/1brc

Folders and files

Latest commit

History

Repository files navigation

One Billion Row Challenge (1BRC) - Rust Implementation

Performance

Architecture Overview

Data Flow

Core Components

1. StationStats

2. FastTable

3. Chunk Processing Pipeline

4. Temperature Parsing

Chunk Boundary Alignment

Dependencies

Build & Run

Input Format

Output Format

Optimization Techniques Used

Performance Breakdown (from profiling)

Hasher Comparison

Code Structure

Future Optimizations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages