Skip to content

Comments

Add tokenizer training CLI and vocabulary persistence#12

Open
MASSIVEMAGNETICS wants to merge 1 commit intomainfrom
codex/train-model-for-victor
Open

Add tokenizer training CLI and vocabulary persistence#12
MASSIVEMAGNETICS wants to merge 1 commit intomainfrom
codex/train-model-for-victor

Conversation

@MASSIVEMAGNETICS
Copy link
Owner

Motivation

  • Provide a simple way to build and persist the tokenizer vocabulary from user text corpora for reproducible tokenization.
  • Allow the FractalTokenKernel_v1_1_0 to save and restore learned vocabularies across runs.
  • Make it easy to train the tokenizer from a single file or a directory of .txt files via a small CLI.
  • Document the tokenizer training workflow in the repository README for discoverability.

Description

  • Add save_vocabulary and load_vocabulary methods to FractalTokenKernel_v1_1_0 to persist vocabulary, token_counts, and next_token_id as JSON.
  • Introduce a new CLI script victor_core/train_tokenizer.py that implements _load_corpus, trains the tokenizer, and writes the vocabulary JSON using tokenizer.save_vocabulary.
  • Update README.md with a usage example showing python -m victor_core.train_tokenizer --input path/to/corpus --output models/tokenizer_vocab.json.
  • Use pathlib.Path and JSON formatting for robust file handling and readable vocabulary files.

Testing

  • No automated tests were run for this change.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant