Rust inference engine for 1-bit BitNet large language models — memory-safe, cross-validated against the C++ reference, with SIMD/CUDA acceleration.
# 1. Download a model
cargo run -p xtask -- download-model --id microsoft/bitnet-b1.58-2B-4T-gguf
# 2. Run inference
RUST_LOG=warn cargo run -p bitnet-cli --no-default-features --features cpu,full-cli -- run \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.json \
--prompt "What is 2+2?" \
--max-tokens 8
# 3. Deterministic benchmark + receipt verification
BITNET_DETERMINISTIC=1 BITNET_SEED=42 RAYON_NUM_THREADS=1 \
cargo run -p xtask -- benchmark \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--tokens 128
cargo run -p xtask -- verify-receipt
# 4. Interactive chat
RUST_LOG=warn cargo run -p bitnet-cli --no-default-features --features cpu,full-cli -- chat \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.jsonAlways specify
--no-default-features --features cpu|gpu— default features are empty by design.
┌────────────────────────────────────────────────────────────┐
│ bitnet-cli / bitnet-server │
└────────────────────┬───────────────────────────────────────┘
│
┌──────────▼──────────┐
│ bitnet-inference │ autoregressive engine
│ ┌────────────────┐ │
│ │ bitnet-sampling│ │ temperature / top-k / top-p
│ │ bitnet-prompt- │ │ chat templates (raw/instruct/llama3)
│ │ templates │ │
│ │ bitnet-receipts│ │ honest-compute receipts
│ │ bitnet-logits │ │ logit transforms / penalties
│ │ bitnet- │ │ decode loop / stop criteria
│ │ generation │ │
│ └────────────────┘ │
└──────────┬──────────┘
│
┌───────────────▼─────────────────┐
│ bitnet-models │ GGUF loading, transformer
│ ┌──────────────────────────┐ │
│ │ bitnet-quantization │ │ I2_S / TL1 / TL2 / IQ2_S
│ │ bitnet-kernels (SIMD) │ │ AVX2 / AVX-512 / NEON / CUDA
│ │ bitnet-gguf │ │ GGUF parser (fuzz-tested)
│ └──────────────────────────┘ │
└──────────────────────────────────┘
│
┌───────────────▼──────────────────┐
│ bitnet-tokenizers │ universal tokenizer + auto-discovery
│ bitnet-device-probe │ OS/GPU probing + capability snapshot
│ bitnet-engine-core │ session / orchestration contracts
└──────────────────────────────────┘
| Feature | Status | Notes |
|---|---|---|
| CPU inference — I2_S QK256 | ✅ | Scalar kernels (~0.1 tok/s on 2B); AVX2 foundation merged |
| CPU inference — I2_S BitNet32 | ✅ | Production path, 10-20× faster than QK256 scalar |
| GPU inference — CUDA | Implemented; receipt validation pending | |
| Interactive chat (REPL) | ✅ | /help, /clear, /metrics, auto-template detection |
| Cross-validation vs C++ | ✅ | Cosine similarity > 0.99, per-token comparison |
| Receipt / honest-compute | ✅ | Schema v1.0.0, 8 validation gates |
| Strict mode | ✅ | Runtime guards prevent mock fallback |
| SafeTensors → GGUF export | ✅ | bitnet-st2gguf with F16 LayerNorm preservation |
| Backend selection + reporting | ✅ | requested=X detected=[…] selected=Y at startup |
| CPU golden path E2E tests | ✅ | 5 deterministic tests, always-on in PR CI |
| Server / HTTP API | 🚧 | Health endpoints wired; serving endpoints have TODOs |
# CPU (recommended for development)
cargo build --no-default-features --features cpu
# CPU — release + native SIMD
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=thin" \
cargo build --release --no-default-features --features cpu,full-cli
# GPU (requires CUDA 12.x)
cargo build --no-default-features --features gpu
# Nix (reproducible, identical to CI)
nix develop
nix build .#bitnet-cli
nix flake check# All tests (nextest recommended — 5 min timeout)
cargo nextest run --workspace --no-default-features --features cpu
# CI profile (4 threads, no retries)
cargo nextest run --profile ci
# GGUF fixture tests
cargo test -p bitnet-models --test qk256_dual_flavor_tests --no-default-features --features fixtures
# Skip slow QK256 scalar tests
BITNET_SKIP_SLOW_TESTS=1 cargo nextest run \
--workspace --no-default-features --features cpuOrganised by Diátaxis:
| Section | Contents |
|---|---|
| Tutorials | Getting started, first inference, tokenizer discovery |
| How-to | Install, run inference, export GGUF, cross-validate, validate models |
| Explanation | Architecture, quantization formats, dual-backend, features |
| Reference | CLI flags, environment variables, API, quantization support |
- Quickstart
- Environment variables
- GPU setup
- C++ cross-validation setup
- Quantization support
- Validation gates
- QK256 Usage Guide — GGML I2_S QK256 Format with 256-element blocks and
--strict-loadervalidation - Dual I2_S Flavor Architecture — how bitnet-rs differentiates between I2_S format variants
bitnet-rs uses "honest-compute" receipts to verify real inference (no mock fallback).
# Run benchmark and write receipt
cargo run -p xtask -- benchmark \
--model models/model.gguf --tokens 128
# Verify receipt against quality gates
cargo run -p xtask -- verify-receipt
# Strict mode — fail on suspicious LN weights (exit code 8)
BITNET_STRICT_MODE=1 cargo run -p xtask -- verify-receiptReceipt JSON schema (v1.0.0):
{
"version": "1.0.0",
"compute_path": "real",
"kernels": ["i2s_cpu_avx2"],
"tokens_per_sec": 0.1,
"success": true
}Key environment variables:
| Variable | Purpose |
|---|---|
BITNET_DETERMINISTIC |
Enable deterministic inference |
BITNET_SEED |
Random seed for reproducibility |
RAYON_NUM_THREADS |
Worker thread count (1 = single-threaded) |
BITNET_STRICT_MODE |
Fail on validation warnings |
Kernel ID hygiene: all kernel IDs must be non-empty strings ≤ 128 chars. See baselines/ for reference receipts.
See CONTRIBUTING.md. Issues and pull requests welcome.
# Format + lint
cargo fmt --all && cargo clippy --all-targets --all-features -- -D warnings
# Run tests before pushing
cargo nextest run --workspace --no-default-features --features cpuDual-licensed under MIT and Apache 2.0.