Provide canonical information for llama tokenizer by kheiss-uwzoo · Pull Request #1380 · NVIDIA/nv-ingest

kheiss-uwzoo · 2026-02-05T21:06:38Z

Summary

Consolidates all Llama tokenizer documentation into one canonical section and replaces duplicate explanations with links, so future changes (model ID, Hugging Face token, pre‑download behavior) only need to be updated in a single place.

Changes

Added a dedicated “Llama tokenizer (default)” section in chunking.md that documents:
The default tokenizer ID (meta-llama/Llama-3.2-1B) and why it’s recommended.
The requirement for a Hugging Face access token.
Pre‑download behavior and build‑time configuration (DOWNLOAD_LLAMA_TOKENIZER, HF_ACCESS_TOKEN), with a link to Environment Variables for details.
Updated the token‑based splitting intro to point to this section instead of repeating the same content.
Merged the former “Pre-download the Tokenizer” subsection into the new canonical section.
Converted other mentions into link‑only references (no duplicated text):
environment-config.md now links to “Llama tokenizer (default)” for DOWNLOAD_LLAMA_TOKENIZER and HF_ACCESS_TOKEN.
releasenotes-nv-ingest.md release note about the tokenizer being pre‑downloaded now links to the canonical section.
nv-ingest-python-api.md “Extract Audio” example notes that it uses the default tokenizer and links to the canonical section; code and API usage are unchanged.

Files changed

extraction/chunking.md
extraction/environment-config.md
extraction/releasenotes-nv-ingest.md
extraction/nv-ingest-python-api.md

Rationale

Reduces the risk of the docs getting out of sync by having a single authoritative description of the Llama tokenizer, with other pages linking to it instead of duplicating the same information.

sosahi · 2026-02-05T21:11:56Z

docs/docs/extraction/chunking.md

 )
 ```

+### Llama tokenizer (default) {#llama-tokenizer-default}


was this "{#llama-tokenizer-default}" added intentionally?

If you intend that to be a section anchor, I don't think it will work across both GitHub and docs.nvidia.com

kheiss-uwzoo · 2026-02-05T21:13:19Z

Yes - should it be removed?

________________________________ From: sosahi ***@***.***> Sent: Thursday, February 5, 2026 1:12 PM To: NVIDIA/nv-ingest ***@***.***> Cc: Kurt Heiss ***@***.***>; Author ***@***.***> Subject: Re: [NVIDIA/nv-ingest] Provide canonical information for llama tokenizer (PR #1380) @sosahi commented on this pull request.

________________________________ In docs/docs/extraction/chunking.md<#1380 (comment)>:

@@ -76,6 +72,23 @@ ingestor = ingestor.split(

) ``` +### Llama tokenizer (default) {#llama-tokenizer-default} was this "{#llama-tokenizer-default}" added intentionally? — Reply to this email directly, view it on GitHub<#1380 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/B3723WCXQSRWSLJ7ZG4VZRT4KOW3JAVCNFSM6AAAAACUEUJ4CSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTONJZGQ4TCOJVGM>. You are receiving this because you authored the thread.Message ID: ***@***.***>

docs/docs/extraction/chunking.md

nkmcalli · 2026-02-05T23:03:51Z

docs/docs/extraction/chunking.md

 )
 ```

+### Llama tokenizer (default) {#llama-tokenizer-default}


If you intend that to be a section anchor, I don't think it will work across both GitHub and docs.nvidia.com

docs/docs/extraction/chunking.md

Co-authored-by: nkmcalli <nkmcalli@yahoo.com>

Merge remote-tracking branch 'upstream/main' into kheiss/canon

Merge branch 'kheiss/canon' of github.com:kheiss-uwzoo/nv-ingest into kheiss/canon

provide canonical information for llama tokenizer

7516353

kheiss-uwzoo requested a review from a team as a code owner February 5, 2026 21:06

kheiss-uwzoo requested a review from jioffe502 February 5, 2026 21:06

kheiss-uwzoo added the doc Improvements or additions to documentation label Feb 5, 2026

kheiss-uwzoo requested review from nkmcalli and sosahi February 5, 2026 21:07

Merge branch 'main' into kheiss/canon

23fcf72

sosahi reviewed Feb 5, 2026

View reviewed changes

sosahi approved these changes Feb 5, 2026

View reviewed changes

nkmcalli requested changes Feb 5, 2026

View reviewed changes

kheiss-uwzoo and others added 8 commits February 5, 2026 15:08

Update docs/docs/extraction/chunking.md

91a67ae

Co-authored-by: nkmcalli <nkmcalli@yahoo.com>

Update docs/docs/extraction/chunking.md

bc3fc6c

Co-authored-by: nkmcalli <nkmcalli@yahoo.com>

Update docs/docs/extraction/chunking.md

f4ecf8b

Co-authored-by: nkmcalli <nkmcalli@yahoo.com>

Update docs/docs/extraction/chunking.md

392ea2d

Co-authored-by: nkmcalli <nkmcalli@yahoo.com>

getting ready to stage build

fe97433

Merge remote-tracking branch 'upstream/main' into kheiss/canon

latest files

86afe53

Merge branch 'kheiss/canon' of github.com:kheiss-uwzoo/nv-ingest into kheiss/canon

updating files

89a91ac

Merge remote-tracking branch 'upstream/main' into kheiss/canon

bfafaea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide canonical information for llama tokenizer#1380

Provide canonical information for llama tokenizer#1380
kheiss-uwzoo wants to merge 10 commits intoNVIDIA:mainfrom
kheiss-uwzoo:kheiss/canon

kheiss-uwzoo commented Feb 5, 2026

Uh oh!

sosahi Feb 5, 2026

Uh oh!

kheiss-uwzoo Feb 5, 2026

Uh oh!

nkmcalli Feb 5, 2026

Uh oh!

kheiss-uwzoo commented Feb 5, 2026 via email

Uh oh!

Uh oh!

nkmcalli Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kheiss-uwzoo commented Feb 5, 2026

Summary

Changes

Files changed

Rationale

Uh oh!

sosahi Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

kheiss-uwzoo Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

nkmcalli Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

kheiss-uwzoo commented Feb 5, 2026 via email

Uh oh!

Uh oh!

nkmcalli Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants