Skip to content

Provide canonical information for llama tokenizer#1380

Open
kheiss-uwzoo wants to merge 10 commits intoNVIDIA:mainfrom
kheiss-uwzoo:kheiss/canon
Open

Provide canonical information for llama tokenizer#1380
kheiss-uwzoo wants to merge 10 commits intoNVIDIA:mainfrom
kheiss-uwzoo:kheiss/canon

Conversation

@kheiss-uwzoo
Copy link
Collaborator

Summary

Consolidates all Llama tokenizer documentation into one canonical section and replaces duplicate explanations with links, so future changes (model ID, Hugging Face token, pre‑download behavior) only need to be updated in a single place.

Changes

  • Added a dedicated “Llama tokenizer (default)” section in chunking.md that documents:

  • The default tokenizer ID (meta-llama/Llama-3.2-1B) and why it’s recommended.

  • The requirement for a Hugging Face access token.

  • Pre‑download behavior and build‑time configuration (DOWNLOAD_LLAMA_TOKENIZER, HF_ACCESS_TOKEN), with a link to Environment Variables for details.

  • Updated the token‑based splitting intro to point to this section instead of repeating the same content.

  • Merged the former “Pre-download the Tokenizer” subsection into the new canonical section.

  • Converted other mentions into link‑only references (no duplicated text):

  • environment-config.md now links to “Llama tokenizer (default)” for DOWNLOAD_LLAMA_TOKENIZER and HF_ACCESS_TOKEN.

  • releasenotes-nv-ingest.md release note about the tokenizer being pre‑downloaded now links to the canonical section.

  • nv-ingest-python-api.md “Extract Audio” example notes that it uses the default tokenizer and links to the canonical section; code and API usage are unchanged.

Files changed

  • extraction/chunking.md

  • extraction/environment-config.md

  • extraction/releasenotes-nv-ingest.md

  • extraction/nv-ingest-python-api.md

Rationale

Reduces the risk of the docs getting out of sync by having a single authoritative description of the Llama tokenizer, with other pages linking to it instead of duplicating the same information.

@kheiss-uwzoo kheiss-uwzoo requested a review from a team as a code owner February 5, 2026 21:06
@kheiss-uwzoo kheiss-uwzoo added the doc Improvements or additions to documentation label Feb 5, 2026
)
```

### Llama tokenizer (default) {#llama-tokenizer-default}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this "{#llama-tokenizer-default}" added intentionally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you intend that to be a section anchor, I don't think it will work across both GitHub and docs.nvidia.com

@kheiss-uwzoo
Copy link
Collaborator Author

kheiss-uwzoo commented Feb 5, 2026 via email

)
```

### Llama tokenizer (default) {#llama-tokenizer-default}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you intend that to be a section anchor, I don't think it will work across both GitHub and docs.nvidia.com

kheiss-uwzoo and others added 8 commits February 5, 2026 15:08
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Co-authored-by: nkmcalli <nkmcalli@yahoo.com>
Merge remote-tracking branch 'upstream/main' into kheiss/canon
Merge branch 'kheiss/canon' of github.com:kheiss-uwzoo/nv-ingest into kheiss/canon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants