Add tokenizer training CLI and vocabulary persistence by MASSIVEMAGNETICS · Pull Request #12 · MASSIVEMAGNETICS/victor_llm

MASSIVEMAGNETICS · 2025-12-28T11:34:03Z

Provide a simple way to build and persist the tokenizer vocabulary from user text corpora for reproducible tokenization.
Allow the FractalTokenKernel_v1_1_0 to save and restore learned vocabularies across runs.
Make it easy to train the tokenizer from a single file or a directory of .txt files via a small CLI.
Document the tokenizer training workflow in the repository README for discoverability.

Add save_vocabulary and load_vocabulary methods to FractalTokenKernel_v1_1_0 to persist vocabulary, token_counts, and next_token_id as JSON.
Introduce a new CLI script victor_core/train_tokenizer.py that implements _load_corpus, trains the tokenizer, and writes the vocabulary JSON using tokenizer.save_vocabulary.
Update README.md with a usage example showing python -m victor_core.train_tokenizer --input path/to/corpus --output models/tokenizer_vocab.json.
Use pathlib.Path and JSON formatting for robust file handling and readable vocabulary files.

Add tokenizer training CLI

88b4e72

MASSIVEMAGNETICS added the codex label Dec 28, 2025 — with ChatGPT Codex Connector

Provide feedback