A high-performance C++ tokenizer using Recursive Sub-word Pruning (RSPA). 38x faster than GPT-4's tiktoken with better morphological segmentation.
nlp tokenizer machine-learning-algorithms trie cpp17 nlp-machine-learning tokenization pybind11 rspa
-
Updated
Jan 17, 2026 - Python