Skip to content

fix(chat): prevent prompt overflow when referencing knowledge documents#697

Open
2561056571 wants to merge 1 commit intowecode-ai:mainfrom
2561056571:wegent/fix-knowledge-prompt-token-overflow
Open

fix(chat): prevent prompt overflow when referencing knowledge documents#697
2561056571 wants to merge 1 commit intowecode-ai:mainfrom
2561056571:wegent/fix-knowledge-prompt-token-overflow

Conversation

@2561056571
Copy link
Collaborator

@2561056571 2561056571 commented Mar 9, 2026

Summary

  • Fix "prompt is too long: 244418 tokens > 200000 maximum" error when referencing knowledge documents in notebook chat mode
  • Add content-aware token estimation that properly handles CJK (Chinese/Japanese/Korean) text instead of naive len // 4 heuristic
  • Reduce document injection threshold from 50% to 30% of context window to reserve room for system prompt, history, and dynamic context
  • Add <selected_documents> recognition to compression strategies so the compressor can truncate notebook document content when needed

Root Cause

Two bugs working together caused the prompt overflow:

  1. Inaccurate token estimation: The total_chars // 4 heuristic dramatically underestimates tokens for CJK content (Chinese characters use ~1.5 tokens/char, not 0.25). A 400k-char Chinese document estimates as 100k tokens but actually has 200k+ tokens, passing the injection threshold incorrectly.

  2. Compression blind spot: The AttachmentTruncationStrategy only recognizes <attachment>, [Attachment, etc. markers but NOT <selected_documents> tags used by notebook mode. This means injected document content is invisible to compression strategies and cannot be truncated as a safety net.

Changes

  1. backend/app/services/chat/preprocessing/selected_documents.py:

    • Add _estimate_tokens() with CJK-aware token estimation (~1.5 tokens/CJK char, ~0.25 tokens/Latin char, 10% safety margin)
    • Reduce MAX_INJECTION_RATIO from 0.5 to 0.3 to leave room for other prompt components
    • Add CJK character detection regex pattern
  2. chat_shell/chat_shell/compression/strategies.py:

    • Add <selected_documents> to DOCUMENT_MARKERS list
    • Update ATTACHMENT_PATTERN regex to match <selected_documents> XML tags and # Reference Documents headers
    • Fix header extraction in _truncate_attachments_proportionally to handle both bracket-style and XML-style document formats

Test plan

  • All 1102 backend tests pass
  • All 41 compression tests pass
  • All 334 chat_shell tests pass (1 pre-existing unrelated failure excluded)
  • Verify notebook chat with large CJK knowledge documents falls back to RAG instead of direct injection
  • Verify notebook chat with smaller documents still uses direct injection correctly
  • Verify compression can truncate <selected_documents> content when prompt exceeds context limits

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved token counting for multilingual content in documents
    • Enhanced attachment format recognition and header extraction during truncation
  • Improvements

    • Optimized context window allocation for better document processing

The notebook chat page throws "prompt is too long" error when referencing
knowledge documents because of two issues:

1. Token estimation uses naive `len // 4` heuristic which dramatically
   underestimates tokens for CJK content (Chinese/Japanese/Korean chars
   use ~1.5 tokens per character, not 0.25). This causes large documents
   to pass the injection threshold when they shouldn't.

2. The compression system's AttachmentTruncationStrategy doesn't
   recognize `<selected_documents>` XML tags, so injected document
   content is invisible to compression strategies and cannot be
   truncated when the prompt exceeds context limits.

Changes:
- Add content-aware token estimation that handles CJK text properly
- Reduce MAX_INJECTION_RATIO from 0.5 to 0.3 to reserve room for
  system prompt, chat history, and dynamic context
- Add `<selected_documents>` to compression strategy's document markers
  so the compressor can detect and truncate notebook document content
- Update regex pattern and header extraction to handle both bracket-style
  and XML-style document formats
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

Two backend modules updated: token estimation improved using CJK character detection with 1.5 and 0.25 token rates respectively, MAX_INJECTION_RATIO reduced from 0.5 to 0.3, and attachment pattern recognition extended to handle <selected_documents> blocks with enhanced header extraction logic.

Changes

Cohort / File(s) Summary
Token Estimation Logic
backend/app/services/chat/preprocessing/selected_documents.py
Replaced naive character-based token counting with content-aware _estimate_tokens() function using CJK character detection. Differentiates between CJK (∼1.5 tokens/char) and non-CJK (∼0.25 tokens/char) content with 10% safety margin. Reduced MAX_INJECTION_RATIO from 0.5 to 0.3 and expanded debug logging.
Attachment Pattern Recognition
chat_shell/chat_shell/compression/strategies.py
Extended ATTACHMENT_PATTERN regex and DOCUMENT_MARKERS to recognize attachments in <selected_documents> blocks alongside <attachment> tags. Improved header extraction during truncation to conditionally use bracketed or XML-style headers based on content position detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 With whiskers twitched and paws held high,
We count the tokens, CJK-style,
From East to West, each character flies,
While attachments hide no more behind disguise—
Better headers, safer margins, all the while! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the pull request: preventing prompt overflow when referencing knowledge documents through improved token estimation and compression handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
backend/app/services/chat/preprocessing/selected_documents.py (3)

226-231: Use finditer instead of findall for memory efficiency.

findall(text) creates a list containing every CJK character as a separate string. For large CJK-heavy documents (e.g., a 5MB Chinese document), this allocates millions of single-character strings just to count them.

♻️ More efficient counting with iterator
-    cjk_chars = len(_CJK_PATTERN.findall(text))
+    cjk_chars = sum(1 for _ in _CJK_PATTERN.finditer(text))
     non_cjk_chars = len(text) - cjk_chars
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines
226 - 231, The current counting uses _CJK_PATTERN.findall(text) which builds a
list of every match; replace that with an iterator-based count using
_CJK_PATTERN.finditer(text) and compute cjk_chars by iterating and counting
matches (e.g., sum over the iterator) to avoid allocating millions of strings;
keep the subsequent non_cjk_chars and estimated calculations unchanged and
update the cjk_chars assignment in the block that currently defines cjk_chars,
non_cjk_chars, and estimated.

229-234: Extract magic numbers to named constants.

The token estimation ratios and safety margin are meaningful values that would benefit from named constants for clarity and maintainability.

♻️ Extract to named constants

Add near the top of the file:

# Token estimation ratios (conservative estimates for injection safety)
CJK_TOKENS_PER_CHAR = 1.5  # CJK characters typically produce 1-2 tokens
NON_CJK_TOKENS_PER_CHAR = 0.25  # ~4 ASCII/Latin characters per token
TOKEN_SAFETY_MARGIN = 1.1  # 10% buffer for tokenizer overhead

Then in _estimate_tokens:

-    estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25)
-    estimated = int(estimated * 1.1)
+    estimated = int(cjk_chars * CJK_TOKENS_PER_CHAR + non_cjk_chars * NON_CJK_TOKENS_PER_CHAR)
+    estimated = int(estimated * TOKEN_SAFETY_MARGIN)

As per coding guidelines, "Extract magic numbers to named constants in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines
229 - 234, Extract the numeric literals in the token estimation logic into named
module-level constants and use them inside _estimate_tokens: define
CJK_TOKENS_PER_CHAR = 1.5, NON_CJK_TOKENS_PER_CHAR = 0.25 and
TOKEN_SAFETY_MARGIN = 1.1 (or similar descriptive names) near the top of the
file, then replace the magic numbers in the _estimate_tokens calculation (the
lines computing estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25) and
estimated = int(estimated * 1.1)) to reference these constants for clarity and
maintainability.

206-242: Note: Token estimation inconsistency with RAG indexer.

The new _estimate_tokens uses CJK-aware estimation (1.5 tokens/CJK char), but the RAG indexer at backend/app/services/rag/index/indexer.py:290-307 uses the naive len(text) // 4 formula. For CJK-heavy documents, these estimates can differ by ~6x.

This doesn't cause functional issues here (conservative estimation is the goal), but may be worth unifying if accurate cross-pipeline token accounting becomes important.

Consider extracting _estimate_tokens to a shared utility (e.g., backend/app/utils/tokens.py) that both modules can use for consistent estimation. As per coding guidelines, common logic should be extracted into shared utilities.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines
206 - 242, Extract the CJK-aware logic in _estimate_tokens into a shared utility
function (e.g., estimate_tokens) in a new module tokens.py and replace the local
implementations so both places use the same function; specifically, move the
body of _estimate_tokens into backend/app/utils/tokens.py as
estimate_tokens(text: str) -> int, export it, then update the current
_estimate_tokens caller to import and call estimate_tokens, and replace the
naive len(text) // 4 usage in the RAG indexer (the token estimation site using
len(text) // 4) to call estimate_tokens(text) instead so both pipelines share
consistent token accounting.
chat_shell/chat_shell/compression/strategies.py (1)

492-504: Header extraction logic is correct but could benefit from a clarifying comment.

The position-based comparison correctly handles both formats. However, consider the edge case where attachment_content is empty — content_start would be -1 (from .find() on empty string not found), and the logic would correctly fall through to header = "".

♻️ Optional: Minor clarity improvement
             # and XML-style headers (e.g., "<selected_documents># Reference Documents\n...")
             bracket_pos = full_match.find("]")
             content_start = (
                 full_match.find(attachment_content) if attachment_content else -1
             )
+            # Choose the delimiter that appears first to extract the header
             if bracket_pos >= 0 and (content_start < 0 or bracket_pos < content_start):
                 header = full_match[: bracket_pos + 1]
             elif content_start > 0:
                 header = full_match[:content_start]
             else:
+                # No recognizable header structure
                 header = ""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chat_shell/chat_shell/compression/strategies.py` around lines 492 - 504,
Clarify the header extraction in the block using full_match, attachment_content,
bracket_pos, content_start and header by adding a concise comment that explains
content_start is set via .find(...) which yields -1 when attachment_content
isn't present (including empty/None cases), and that the branch ordering
(bracket_pos >= 0 and (content_start < 0 or bracket_pos < content_start))
intentionally prefers bracket-style headers when a closing ] occurs before any
attachment_content match, otherwise slices up to content_start, and falls back
to header = "" when no header marker is found; this makes the edge case of
attachment_content not found explicit for future readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chat_shell/chat_shell/compression/strategies.py`:
- Around line 163-172: The attachment-matching regex (ATTACHMENT_PATTERN)
consumes the closing tag which then gets dropped when rebuilding content in
_truncate_attachments_proportionally; fix by making the regex capture the
closing tag (e.g., as a named group "close_tag") or otherwise expose it from the
match, and then in _truncate_attachments_proportionally append that captured
close_tag (but only if it is an actual closing tag, e.g., startswith("</")) to
the reconstructed string (header + begin_part + truncation_notice + end_part +
close_tag) so the XML stay well-formed.

---

Nitpick comments:
In `@backend/app/services/chat/preprocessing/selected_documents.py`:
- Around line 226-231: The current counting uses _CJK_PATTERN.findall(text)
which builds a list of every match; replace that with an iterator-based count
using _CJK_PATTERN.finditer(text) and compute cjk_chars by iterating and
counting matches (e.g., sum over the iterator) to avoid allocating millions of
strings; keep the subsequent non_cjk_chars and estimated calculations unchanged
and update the cjk_chars assignment in the block that currently defines
cjk_chars, non_cjk_chars, and estimated.
- Around line 229-234: Extract the numeric literals in the token estimation
logic into named module-level constants and use them inside _estimate_tokens:
define CJK_TOKENS_PER_CHAR = 1.5, NON_CJK_TOKENS_PER_CHAR = 0.25 and
TOKEN_SAFETY_MARGIN = 1.1 (or similar descriptive names) near the top of the
file, then replace the magic numbers in the _estimate_tokens calculation (the
lines computing estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25) and
estimated = int(estimated * 1.1)) to reference these constants for clarity and
maintainability.
- Around line 206-242: Extract the CJK-aware logic in _estimate_tokens into a
shared utility function (e.g., estimate_tokens) in a new module tokens.py and
replace the local implementations so both places use the same function;
specifically, move the body of _estimate_tokens into backend/app/utils/tokens.py
as estimate_tokens(text: str) -> int, export it, then update the current
_estimate_tokens caller to import and call estimate_tokens, and replace the
naive len(text) // 4 usage in the RAG indexer (the token estimation site using
len(text) // 4) to call estimate_tokens(text) instead so both pipelines share
consistent token accounting.

In `@chat_shell/chat_shell/compression/strategies.py`:
- Around line 492-504: Clarify the header extraction in the block using
full_match, attachment_content, bracket_pos, content_start and header by adding
a concise comment that explains content_start is set via .find(...) which yields
-1 when attachment_content isn't present (including empty/None cases), and that
the branch ordering (bracket_pos >= 0 and (content_start < 0 or bracket_pos <
content_start)) intentionally prefers bracket-style headers when a closing ]
occurs before any attachment_content match, otherwise slices up to
content_start, and falls back to header = "" when no header marker is found;
this makes the edge case of attachment_content not found explicit for future
readers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61d1af2e-a46c-4e0c-adad-fec9d90f4817

📥 Commits

Reviewing files that changed from the base of the PR and between 835922b and d917086.

📒 Files selected for processing (2)
  • backend/app/services/chat/preprocessing/selected_documents.py
  • chat_shell/chat_shell/compression/strategies.py

Comment on lines +163 to 172
# Pattern to match attachment content blocks wrapped in <attachment> XML tags,
# <selected_documents> XML tags, or legacy format without tags
ATTACHMENT_PATTERN = re.compile(
r"(?:<attachment>)?\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\](.*?)(?:</attachment>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|$))",
r"(?:<attachment>|<selected_documents>)?"
r"(?:\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\]|"
r"# Reference Documents\s*\n\s*[^\n]*\n\s*)"
r"(.*?)"
r"(?:</attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))",
re.DOTALL,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Closing tag is lost during truncation.

The regex correctly matches <selected_documents> opening and closing tags, but the closing tag is consumed by the match (not a lookahead) and included in full_match.group(0). However, when _truncate_attachments_proportionally reconstructs the truncated content at line 531, it only returns header + begin_part + truncation_notice + end_part — the closing tag is dropped.

This produces malformed XML output like:

<selected_documents>
# Reference Documents
...truncated content...
[... Middle content truncated ...]
...end portion...

Missing </selected_documents>.

🐛 Proposed fix: preserve closing tag
 ATTACHMENT_PATTERN = re.compile(
-    r"(?:<attachment>|<selected_documents>)?"
+    r"(?P<open_tag><attachment>|<selected_documents>)?"
     r"(?:\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\]|"
     r"# Reference Documents\s*\n\s*[^\n]*\n\s*)"
     r"(.*?)"
-    r"(?:</attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))",
+    r"(?P<close_tag></attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))",
     re.DOTALL,
 )

Then in _truncate_attachments_proportionally, extract and append the closing tag:

# At end of truncate_match function
close_tag = match.group("close_tag") or ""
# Only include if it's an actual tag, not lookahead match
if close_tag.startswith("</"):
    return header + begin_part + truncation_notice + end_part + close_tag
return header + begin_part + truncation_notice + end_part
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@chat_shell/chat_shell/compression/strategies.py` around lines 163 - 172, The
attachment-matching regex (ATTACHMENT_PATTERN) consumes the closing tag which
then gets dropped when rebuilding content in
_truncate_attachments_proportionally; fix by making the regex capture the
closing tag (e.g., as a named group "close_tag") or otherwise expose it from the
match, and then in _truncate_attachments_proportionally append that captured
close_tag (but only if it is an actual closing tag, e.g., startswith("</")) to
the reconstructed string (header + begin_part + truncation_notice + end_part +
close_tag) so the XML stay well-formed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant