fix: run real ingestion during token estimation by sumukshashidhar · Pull Request #199 · huggingface/yourbench

sumukshashidhar · 2025-12-29T12:57:49Z

Summary

Fixes the token estimation logic to use actual document content instead of hardcoded estimates.

Changes

Replace hardcoded 1K token estimate for PDFs with actual document extraction
Use MarkItDown (without LLM) to extract content from PDFs, docx, etc.
Simulate chunking to get accurate chunk counts
Token estimates now based on real tiktoken encoding of actual content

Before

Source Documents:
  Files: 1
  Estimated tokens: 1K  # hardcoded guess

After

Source Documents:
  Files: 1
  Estimated tokens: 33.8K  # actual extracted content

Single Hop Question Generation:
  Input: 38.8K, Output: 7.5K, Calls: 5  # based on real chunk count

Testing

All 91 tests pass
Verified with yourbench estimate example/default_example/config.yaml
Verified with yourbench estimate config.yaml in harry_potter_quizz directory

- Replace hardcoded 1K token estimate with actual document extraction - Use MarkItDown (without LLM) to extract content from PDFs/docs - Simulate chunking to get accurate chunk counts - Token estimates now based on real tiktoken encoding of actual content

sumukshashidhar added 3 commits December 29, 2025 12:57

style: fix ruff formatting

b936968

feat: add output token range estimation (25%-75%)

29b110d

sumukshashidhar merged commit d549807 into main Dec 30, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: run real ingestion during token estimation#199

fix: run real ingestion during token estimation#199
sumukshashidhar merged 3 commits intomainfrom
fix/estimation-logic

sumukshashidhar commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sumukshashidhar commented Dec 29, 2025

Summary

Changes

Before

After

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant