Skip to content

Misc packing improvements#5189

Open
mariosasko wants to merge 5 commits intohuggingface:mainfrom
mariosasko:vectorized-bfd-chunking
Open

Misc packing improvements#5189
mariosasko wants to merge 5 commits intohuggingface:mainfrom
mariosasko:vectorized-bfd-chunking

Conversation

@mariosasko
Copy link
Contributor

@mariosasko mariosasko commented Feb 26, 2026

What does this PR do?

This PR improves the packing logic to make it faster, less error-prone, and easier to read.

The main changes are:

  • Replacing the AI-generated BFD splitting (a.k.a. "requeuing") logic from Preserve truncated tokens in BFD packing #4632
    with a vectorized implementation that is significantly shorter and 30% faster.

  • Updating pack_examples to restore the input dataset’s format and perform proper input validation.

  • Applying a minor optimization to the wrapped packing implementation by reusing the offsets across all packed columns.

  • Aligning the naming with the literature (e.g., requeuesplit) while preserving backward compatibility.

P.S. The recent Qwen3-Coder-Next technical report includes a nice comparison of packing techniques, which would be a nice addition to the docs. However, the report is not yet available on arXiv, so it cannot be cited as an HF paper.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants