Skip to content

bsayed/1brc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

One Billion Row Challenge (1BRC) - Rust Implementation

A high-performance Rust solution for the One Billion Row Challenge, designed to process 1 billion temperature measurements as fast as possible.

Performance

  • Current: ~1.5 seconds on Apple M4 Max (10P + 4E cores)
  • File size: ~15.9 GB (1 billion rows)
  • Throughput: ~10.6 GB/s

Architecture Overview

flowchart TB
    subgraph Input
        F[measurements.txt<br/>15.9 GB]
    end

    subgraph Memory Mapping
        M[Memory-Mapped File<br/>mmap + Sequential Advice]
    end

    subgraph Parallel Processing
        C1[Chunk 1]
        C2[Chunk 2]
        C3[Chunk ...]
        CN[Chunk N]
    end

    subgraph Per-Thread Processing
        P1[FastTable 1]
        P2[FastTable 2]
        P3[FastTable ...]
        PN[FastTable N]
    end

    subgraph Merge Phase
        R[Reduce/Merge<br/>All FastTables]
    end

    subgraph Output
        S[Sort by Station Name]
        O[Formatted Output<br/>stdout]
    end

    F --> M
    M --> C1 & C2 & C3 & CN
    C1 --> P1
    C2 --> P2
    C3 --> P3
    CN --> PN
    P1 & P2 & P3 & PN --> R
    R --> S --> O
Loading

Data Flow

sequenceDiagram
    participant Main
    participant Rayon as Rayon ThreadPool
    participant Chunk as Chunk Processor
    participant Table as FastTable

    Main->>Main: Open file & create mmap
    Main->>Main: Set Sequential read advice
    Main->>Main: Calculate chunk boundaries
    Main->>Rayon: Spawn parallel iterators
    
    par For each chunk
        Rayon->>Chunk: process_chunk(data)
        loop For each line
            Chunk->>Chunk: Find semicolon (memchr SIMD)
            Chunk->>Chunk: Extract station name (zero-copy)
            Chunk->>Chunk: Hash name (xxHash3)
            Chunk->>Chunk: Parse temperature (inline unsafe)
            Chunk->>Table: update(name, hash, value)
        end
        Chunk-->>Rayon: Return FastTable
    end

    Rayon->>Main: Reduce (merge all tables)
    Main->>Main: Sort results
    Main->>Main: Write to stdout
Loading

Core Components

1. StationStats

Stores aggregated statistics for each weather station:

struct StationStats<'a> {
    name: &'a [u8],  // Reference to mmap (zero-copy)
    hash: u64,       // Cached hash for lookup
    min: i32,        // Minimum temp (scaled by 10)
    max: i32,        // Maximum temp (scaled by 10)
    sum: i64,        // Sum for average calculation
    count: u32,      // Number of readings
}

2. FastTable

Custom open-addressed hash table optimized for this workload:

flowchart LR
    subgraph FastTable["FastTable (65,536 slots)"]
        direction TB
        S0["Slot 0: None"]
        S1["Slot 1: Some(Stats)"]
        S2["Slot 2: None"]
        S3["Slot 3: Some(Stats)"]
        SN["..."]
    end

    H[hash & MASK] --> |index| FastTable
    
    subgraph Probing["Linear Probing"]
        P1["1. Check slot[idx]"]
        P2["2. Compare hash only (fast)"]
        P3["3. If collision: idx = (idx+1) & MASK"]
    end
Loading

Key features:

  • Size: 65,536 slots (2^16, power of 2 for fast modulo via bitmask)
  • Collision resolution: Linear probing
  • Lookup: Hash-only comparison (no string comparison for speed)
  • Load factor: ~15% (10K stations / 65K slots) - minimal collisions

3. Chunk Processing Pipeline

flowchart LR
    subgraph Input["Raw Bytes"]
        D["London;15.2\nParis;-3.4\nTokyo;22.1\n..."]
    end

    subgraph Parse["Parsing Steps"]
        S1["1. memchr(';') - NEON SIMD"]
        S2["2. Extract name slice"]
        S3["3. xxh3_64(name)"]
        S4["4. Parse temp (unsafe)"]
    end

    subgraph Update["Hash Table Update"]
        U["FastTable.update()<br/>Hash-only lookup"]
    end

    D --> S1 --> S2 --> S3 --> S4 --> U
Loading

4. Temperature Parsing

Optimized inline parser handles formats: X.X, -X.X, XX.X, -XX.X

flowchart TD
    I[Input bytes after ';'] --> N{First byte == '-'?}
    N -->|Yes| NEG[negative = true<br/>skip 1 byte]
    N -->|No| POS[negative = false]
    NEG --> F
    POS --> F
    F{Second byte == '.'?}
    F -->|Yes| X1["X.X format<br/>d0*10 + d1"]
    F -->|No| X2["XX.X format<br/>d0*100 + d1*10 + d2"]
    X1 --> BR["if neg { val = -val }"]
    X2 --> BR
Loading

Chunk Boundary Alignment

Each thread must process complete lines. Boundary alignment ensures no line is split:

flowchart LR
    subgraph File["Memory-Mapped File"]
        direction LR
        B1["Chunk 1 boundary"]
        B2["Chunk 2 boundary"]
        B3["Chunk 3 boundary"]
    end

    subgraph Alignment["Alignment Strategy"]
        A1["Thread 0: Start at 0"]
        A2["Thread N: Find next '\\n' after boundary"]
        A3["End: Find next '\\n' after chunk end"]
    end

    B1 --> A1
    B2 --> A2
    B3 --> A3
Loading

Dependencies

Crate Purpose
memmap2 Memory-mapped file I/O with advisory hints
rayon Data parallelism via work-stealing
xxhash-rust xxHash3 - fast non-cryptographic hashing
memchr SIMD-optimized byte search (NEON on ARM)

Build & Run

# Build with native CPU optimizations
RUSTFLAGS="-C target-cpu=native" cargo build --release

# Run (first run warms file cache)
time ./target/release/onebrc > /dev/null
time ./target/release/onebrc > /dev/null  # Use this timing

Input Format

StationName;Temperature
London;15.2
Paris;-3.4
Tokyo;22.1
...

Output Format

{Aadorf=-99.9/0.3/99.9, Aalen=-99.9/-0.4/99.9, ...}

Format: StationName=min/avg/max sorted alphabetically.

Optimization Techniques Used

  1. Memory Mapping - Zero-copy file access via mmap
  2. Sequential Advice - Kernel read-ahead optimization
  3. Parallel Processing - Work-stealing via Rayon (14 threads on M4 Max)
  4. Custom Hash Table - 65K slots, hash-only comparison
  5. Zero-Copy Strings - Station names reference mmap directly
  6. SIMD Search - memchr uses NEON for semicolon search
  7. xxHash3 - Fast hashing optimized for short strings
  8. Inline Unsafe Parsing - Temperature parsed with raw pointer arithmetic
  9. Integer Arithmetic - Temperatures scaled by 10 to avoid floats
  10. Minimal Branching - Predictable branches in hot path

Performance Breakdown (from profiling)

Component % Time Notes
xxHash3 hashing ~57% Station name hashing
memchr search ~22% Finding semicolons
FastTable update ~18% Hash table operations
Other ~3% Merge, sort, output

Hasher Comparison

During development, multiple hash functions were tested:

Hasher Time Notes
xxHash3 ~1.5s Winner - optimized for short strings
ahash ~1.5s Similar performance, more complex API
Highway ~3.3s Too much setup overhead for short strings
FxHash ~3.7s Poor distribution caused collisions

Code Structure

src/main.rs
├── StationStats      - Per-station aggregated data
├── FastTable         - Open-addressed hash table
│   ├── new()         - Initialize with 65K slots
│   ├── update()      - Insert/update station (hash-only lookup)
│   ├── merge()       - Combine tables from parallel workers
│   └── merge_entry() - Merge single entry
├── main()            - Entry point, orchestrates parallel processing
├── find_next_newline() - Chunk boundary alignment
└── process_chunk()   - Core parsing loop (memchr + xxHash3 + parse)

Future Optimizations

  • SWAR (SIMD Within A Register) for semicolon search
  • Compute hash while scanning for semicolon (overlap work)
  • Prefetching hints for hash table access
  • Custom allocator for hash table entries

About

One Billion Row Challenge in Rust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published