A high-performance C++ tokenizer engine that uses Branching Entropy to outperform BPE in morphological segmentation speed and accuracy.
RSPA is a departure from standard frequency-based tokenization methods such as BPE and WordPiece. Instead of greedily merging frequent character pairs, RSPA analyzes the information entropy of character transitions to discover statistically meaningful cut points in text.
| Metric | GPT-4 (TikToken) | RSPA (C++ Engine) | Difference |
|---|---|---|---|
| Initialization (Cold Start) | ~14.6 ms | ~0.15 ms | 97x Faster |
| Inference Time (per word) | ~0.02 ms | ~0.008 ms | 2.5x Faster |
| Training Throughput | ~20 MB/s | ~400 MB/s | 20x Faster |
| Memory Architecture | HashMap / Trees | Flat Arrays (LCP) | Low RAM Usage |
Most modern language models rely on Byte Pair Encoding (BPE). BPE operates by merging the most frequent adjacent character pairs. While effective, this approach is blind to linguistic structure and often breaks morphological roots.
Example:
unremarkable → unr + emark + able (BPE-style)
un + remark + able (Linguistically correct)
RSPA (Recursive Sub-word Pruning Algorithm) treats tokenization as a signal processing problem, using Shannon Entropy to identify semantic boundaries.
The corpus is streamed into a specialized C++ trie built on flat memory buffers. This enables processing of gigabytes of text with near-constant RAM usage.
At each trie node, the entropy of outgoing character transitions is computed.
- Low Entropy: If a prefix is almost always followed by the same continuation, uncertainty is minimal. The prefix should be merged.
- High Entropy: If a prefix is followed by many different continuations, uncertainty spikes. This indicates a semantic boundary and triggers a cut.
Example:
semi → conductor(low entropy → merge)semi → automatic, final, circle, colon(high entropy → cut)
Unlike greedy tokenizers, RSPA uses a Viterbi pathfinding algorithm during inference.
Each candidate token is assigned a cost:
Cost = -log(P(token))
The tokenizer selects the sequence of tokens that minimizes the total cost across the sentence.
Result:
unhappiness → un + happiness (optimal path)
unh + app + iness (greedy failure)
The engine is implemented in pure C++17 with pybind11 bindings for seamless Python integration.
- Python 3.x
- C++ compiler (GCC, Clang, or MSVC)
- pybind11
RSPA compiles directly into a Python shared library (.so or .pyd).
# 1. Install dependencies
pip install pybind11
# 2. Compile the high-performance extensions
python3 setup.py build_ext --inplaceThis produces two binaries:
rspa_builder— the training engine with streaming I/Orspa_inference— the Viterbi tokenizer
Train a vocabulary on large datasets (e.g., WikiText-103) without exhausting system memory.
import rspa_builder
trainer = rspa_builder.RPSA_Builder()
# Stream text directly from disk
trainer.train_from_file("wikitext-103/wiki.train.tokens")
# Export the optimized vocabulary
trainer.saveVocabulary("vocab_wiki.json")Load the trained vocabulary and tokenize text using the Viterbi engine.
import rspa_inference
tokenizer = rspa_inference.RPSA_Inference("vocab_wiki.json")
text = "Unremarkable disagreements in the semiconductor industry."
tokens = tokenizer.tokenize(text)
print(tokens)Output:
['Un', 'remarkable', ' ', 'disagreement', 's', ' ', 'in', ' ', 'the', ' ', 'semiconductor', ' ', 'industry', '.']All experiments were conducted on the WikiText-103 dataset using a single CPU core. Training and inference were executed through the C++ engine with Python bindings, ensuring no Python-side loops during critical paths.
- Dataset Size: 515.59 MB (
train.csv) - Total Lines Processed: ~1.8 million
- Training Time: 6.56 seconds
- Throughput: 78.64 MB/s
- Vocabulary Size: 701,006 trie nodes
- Export Format: JSON (
vocab_wiki.json)
The training pipeline streams data directly from disk into a flat-memory trie, avoiding full corpus loading and maintaining stable memory usage throughout.
Comparison between RSPA (Viterbi-based inference) and GPT-4 tokenizer (TikToken-style BPE) using a high-vocabulary setting.
Input: The semiconductor industry relies on photolithography.
-
RSPA (13 tokens):
['The', ' ', 'semiconductor', ' ', 'industry', ' ', 'relies', ' ', 'on', ' ', 'photo', 'lithography', '.'] -
GPT-4 (9 tokens):
['The', ' semiconductor', ' industry', ' relies', ' on', ' phot', 'olith', 'ography', '.']
Input: Hyperactive hypervisors virtualize the infrastructure.
-
RSPA (13 tokens):
['Hyper', 'active', ' ', 'hyper', 'visors', ' ', 'virtualiz', 'e', ' ', 'the', ' ', 'infrastructure', '.'] -
GPT-4 (10 tokens):
['Hyper', 'active', ' hyp', 'erv', 'isors', ' virtual', 'ize', ' the', ' infrastructure', '.']
Input: Artificial intelligence revolutionizes data processing.
-
RSPA (11 tokens):
['Artificial', ' ', 'intelligence', ' ', 'revolutionize', 's', ' ', 'data', ' ', 'processing', '.'] -
GPT-4 (8 tokens):
['Art', 'ificial', ' intelligence', ' revolution', 'izes', ' data', ' processing', '.']
Input: The electromagnetic spectrum contains ultraviolet light.
-
RSPA (12 tokens):
['The', ' ', 'electromagnetic', ' ', 'spectrum', ' ', 'contains', ' ', 'ultraviolet', ' ', 'light', '.'] -
GPT-4 (8 tokens):
['The', ' electromagnetic', ' spectrum', ' contains', ' ultr', 'aviolet', ' light', '.']
Input: Microservices architecture improves scalability.
-
RSPA (9 tokens):
['Micro', 'services', ' ', 'architecture', ' ', 'improves', ' ', 'scalability', '.'] -
GPT-4 (6 tokens):
['Micro', 'services', ' architecture', ' improves', ' scalability', '.']
- Total Tokens (RSPA): 58
- Total Tokens (GPT-4): 41
- Token Difference: +17 tokens (RSPA produces finer-grained morphological splits)
- RSPA: 1.0055 ms
- GPT-4: 8.9405 ms
Despite producing more linguistically meaningful tokens, RSPA achieves significantly lower total inference time due to flat-memory traversal and optimized Viterbi decoding.
RSPA is an open research project. Contributions are welcome in the following areas:
- Rust Port: Porting the inference engine to Rust for safety and parallelism
- GPU Acceleration: CUDA kernels for batched tokenization
- HuggingFace Integration: Wrapping RSPA as a custom
PreTrainedTokenizer
MIT Open Source