Skip to content

fix: run real ingestion during token estimation#199

Merged
sumukshashidhar merged 3 commits intomainfrom
fix/estimation-logic
Dec 30, 2025
Merged

fix: run real ingestion during token estimation#199
sumukshashidhar merged 3 commits intomainfrom
fix/estimation-logic

Conversation

@sumukshashidhar
Copy link
Collaborator

Summary

Fixes the token estimation logic to use actual document content instead of hardcoded estimates.

Changes

  • Replace hardcoded 1K token estimate for PDFs with actual document extraction
  • Use MarkItDown (without LLM) to extract content from PDFs, docx, etc.
  • Simulate chunking to get accurate chunk counts
  • Token estimates now based on real tiktoken encoding of actual content

Before

Source Documents:
  Files: 1
  Estimated tokens: 1K  # hardcoded guess

After

Source Documents:
  Files: 1
  Estimated tokens: 33.8K  # actual extracted content

Single Hop Question Generation:
  Input: 38.8K, Output: 7.5K, Calls: 5  # based on real chunk count

Testing

  • All 91 tests pass
  • Verified with yourbench estimate example/default_example/config.yaml
  • Verified with yourbench estimate config.yaml in harry_potter_quizz directory

- Replace hardcoded 1K token estimate with actual document extraction
- Use MarkItDown (without LLM) to extract content from PDFs/docs
- Simulate chunking to get accurate chunk counts
- Token estimates now based on real tiktoken encoding of actual content
@sumukshashidhar sumukshashidhar merged commit d549807 into main Dec 30, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant