fix(chat): prevent prompt overflow when referencing knowledge documents#697
fix(chat): prevent prompt overflow when referencing knowledge documents#6972561056571 wants to merge 1 commit intowecode-ai:mainfrom
Conversation
The notebook chat page throws "prompt is too long" error when referencing knowledge documents because of two issues: 1. Token estimation uses naive `len // 4` heuristic which dramatically underestimates tokens for CJK content (Chinese/Japanese/Korean chars use ~1.5 tokens per character, not 0.25). This causes large documents to pass the injection threshold when they shouldn't. 2. The compression system's AttachmentTruncationStrategy doesn't recognize `<selected_documents>` XML tags, so injected document content is invisible to compression strategies and cannot be truncated when the prompt exceeds context limits. Changes: - Add content-aware token estimation that handles CJK text properly - Reduce MAX_INJECTION_RATIO from 0.5 to 0.3 to reserve room for system prompt, chat history, and dynamic context - Add `<selected_documents>` to compression strategy's document markers so the compressor can detect and truncate notebook document content - Update regex pattern and header extraction to handle both bracket-style and XML-style document formats
📝 WalkthroughWalkthroughTwo backend modules updated: token estimation improved using CJK character detection with 1.5 and 0.25 token rates respectively, MAX_INJECTION_RATIO reduced from 0.5 to 0.3, and attachment pattern recognition extended to handle Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
backend/app/services/chat/preprocessing/selected_documents.py (3)
226-231: Usefinditerinstead offindallfor memory efficiency.
findall(text)creates a list containing every CJK character as a separate string. For large CJK-heavy documents (e.g., a 5MB Chinese document), this allocates millions of single-character strings just to count them.♻️ More efficient counting with iterator
- cjk_chars = len(_CJK_PATTERN.findall(text)) + cjk_chars = sum(1 for _ in _CJK_PATTERN.finditer(text)) non_cjk_chars = len(text) - cjk_chars🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines 226 - 231, The current counting uses _CJK_PATTERN.findall(text) which builds a list of every match; replace that with an iterator-based count using _CJK_PATTERN.finditer(text) and compute cjk_chars by iterating and counting matches (e.g., sum over the iterator) to avoid allocating millions of strings; keep the subsequent non_cjk_chars and estimated calculations unchanged and update the cjk_chars assignment in the block that currently defines cjk_chars, non_cjk_chars, and estimated.
229-234: Extract magic numbers to named constants.The token estimation ratios and safety margin are meaningful values that would benefit from named constants for clarity and maintainability.
♻️ Extract to named constants
Add near the top of the file:
# Token estimation ratios (conservative estimates for injection safety) CJK_TOKENS_PER_CHAR = 1.5 # CJK characters typically produce 1-2 tokens NON_CJK_TOKENS_PER_CHAR = 0.25 # ~4 ASCII/Latin characters per token TOKEN_SAFETY_MARGIN = 1.1 # 10% buffer for tokenizer overheadThen in
_estimate_tokens:- estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25) - estimated = int(estimated * 1.1) + estimated = int(cjk_chars * CJK_TOKENS_PER_CHAR + non_cjk_chars * NON_CJK_TOKENS_PER_CHAR) + estimated = int(estimated * TOKEN_SAFETY_MARGIN)As per coding guidelines, "Extract magic numbers to named constants in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines 229 - 234, Extract the numeric literals in the token estimation logic into named module-level constants and use them inside _estimate_tokens: define CJK_TOKENS_PER_CHAR = 1.5, NON_CJK_TOKENS_PER_CHAR = 0.25 and TOKEN_SAFETY_MARGIN = 1.1 (or similar descriptive names) near the top of the file, then replace the magic numbers in the _estimate_tokens calculation (the lines computing estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25) and estimated = int(estimated * 1.1)) to reference these constants for clarity and maintainability.
206-242: Note: Token estimation inconsistency with RAG indexer.The new
_estimate_tokensuses CJK-aware estimation (1.5 tokens/CJK char), but the RAG indexer atbackend/app/services/rag/index/indexer.py:290-307uses the naivelen(text) // 4formula. For CJK-heavy documents, these estimates can differ by ~6x.This doesn't cause functional issues here (conservative estimation is the goal), but may be worth unifying if accurate cross-pipeline token accounting becomes important.
Consider extracting
_estimate_tokensto a shared utility (e.g.,backend/app/utils/tokens.py) that both modules can use for consistent estimation. As per coding guidelines, common logic should be extracted into shared utilities.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines 206 - 242, Extract the CJK-aware logic in _estimate_tokens into a shared utility function (e.g., estimate_tokens) in a new module tokens.py and replace the local implementations so both places use the same function; specifically, move the body of _estimate_tokens into backend/app/utils/tokens.py as estimate_tokens(text: str) -> int, export it, then update the current _estimate_tokens caller to import and call estimate_tokens, and replace the naive len(text) // 4 usage in the RAG indexer (the token estimation site using len(text) // 4) to call estimate_tokens(text) instead so both pipelines share consistent token accounting.chat_shell/chat_shell/compression/strategies.py (1)
492-504: Header extraction logic is correct but could benefit from a clarifying comment.The position-based comparison correctly handles both formats. However, consider the edge case where
attachment_contentis empty —content_startwould be -1 (from.find()on empty string not found), and the logic would correctly fall through toheader = "".♻️ Optional: Minor clarity improvement
# and XML-style headers (e.g., "<selected_documents># Reference Documents\n...") bracket_pos = full_match.find("]") content_start = ( full_match.find(attachment_content) if attachment_content else -1 ) + # Choose the delimiter that appears first to extract the header if bracket_pos >= 0 and (content_start < 0 or bracket_pos < content_start): header = full_match[: bracket_pos + 1] elif content_start > 0: header = full_match[:content_start] else: + # No recognizable header structure header = ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@chat_shell/chat_shell/compression/strategies.py` around lines 492 - 504, Clarify the header extraction in the block using full_match, attachment_content, bracket_pos, content_start and header by adding a concise comment that explains content_start is set via .find(...) which yields -1 when attachment_content isn't present (including empty/None cases), and that the branch ordering (bracket_pos >= 0 and (content_start < 0 or bracket_pos < content_start)) intentionally prefers bracket-style headers when a closing ] occurs before any attachment_content match, otherwise slices up to content_start, and falls back to header = "" when no header marker is found; this makes the edge case of attachment_content not found explicit for future readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@chat_shell/chat_shell/compression/strategies.py`:
- Around line 163-172: The attachment-matching regex (ATTACHMENT_PATTERN)
consumes the closing tag which then gets dropped when rebuilding content in
_truncate_attachments_proportionally; fix by making the regex capture the
closing tag (e.g., as a named group "close_tag") or otherwise expose it from the
match, and then in _truncate_attachments_proportionally append that captured
close_tag (but only if it is an actual closing tag, e.g., startswith("</")) to
the reconstructed string (header + begin_part + truncation_notice + end_part +
close_tag) so the XML stay well-formed.
---
Nitpick comments:
In `@backend/app/services/chat/preprocessing/selected_documents.py`:
- Around line 226-231: The current counting uses _CJK_PATTERN.findall(text)
which builds a list of every match; replace that with an iterator-based count
using _CJK_PATTERN.finditer(text) and compute cjk_chars by iterating and
counting matches (e.g., sum over the iterator) to avoid allocating millions of
strings; keep the subsequent non_cjk_chars and estimated calculations unchanged
and update the cjk_chars assignment in the block that currently defines
cjk_chars, non_cjk_chars, and estimated.
- Around line 229-234: Extract the numeric literals in the token estimation
logic into named module-level constants and use them inside _estimate_tokens:
define CJK_TOKENS_PER_CHAR = 1.5, NON_CJK_TOKENS_PER_CHAR = 0.25 and
TOKEN_SAFETY_MARGIN = 1.1 (or similar descriptive names) near the top of the
file, then replace the magic numbers in the _estimate_tokens calculation (the
lines computing estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25) and
estimated = int(estimated * 1.1)) to reference these constants for clarity and
maintainability.
- Around line 206-242: Extract the CJK-aware logic in _estimate_tokens into a
shared utility function (e.g., estimate_tokens) in a new module tokens.py and
replace the local implementations so both places use the same function;
specifically, move the body of _estimate_tokens into backend/app/utils/tokens.py
as estimate_tokens(text: str) -> int, export it, then update the current
_estimate_tokens caller to import and call estimate_tokens, and replace the
naive len(text) // 4 usage in the RAG indexer (the token estimation site using
len(text) // 4) to call estimate_tokens(text) instead so both pipelines share
consistent token accounting.
In `@chat_shell/chat_shell/compression/strategies.py`:
- Around line 492-504: Clarify the header extraction in the block using
full_match, attachment_content, bracket_pos, content_start and header by adding
a concise comment that explains content_start is set via .find(...) which yields
-1 when attachment_content isn't present (including empty/None cases), and that
the branch ordering (bracket_pos >= 0 and (content_start < 0 or bracket_pos <
content_start)) intentionally prefers bracket-style headers when a closing ]
occurs before any attachment_content match, otherwise slices up to
content_start, and falls back to header = "" when no header marker is found;
this makes the edge case of attachment_content not found explicit for future
readers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 61d1af2e-a46c-4e0c-adad-fec9d90f4817
📒 Files selected for processing (2)
backend/app/services/chat/preprocessing/selected_documents.pychat_shell/chat_shell/compression/strategies.py
| # Pattern to match attachment content blocks wrapped in <attachment> XML tags, | ||
| # <selected_documents> XML tags, or legacy format without tags | ||
| ATTACHMENT_PATTERN = re.compile( | ||
| r"(?:<attachment>)?\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\](.*?)(?:</attachment>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|$))", | ||
| r"(?:<attachment>|<selected_documents>)?" | ||
| r"(?:\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\]|" | ||
| r"# Reference Documents\s*\n\s*[^\n]*\n\s*)" | ||
| r"(.*?)" | ||
| r"(?:</attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))", | ||
| re.DOTALL, | ||
| ) |
There was a problem hiding this comment.
Closing tag is lost during truncation.
The regex correctly matches <selected_documents> opening and closing tags, but the closing tag is consumed by the match (not a lookahead) and included in full_match.group(0). However, when _truncate_attachments_proportionally reconstructs the truncated content at line 531, it only returns header + begin_part + truncation_notice + end_part — the closing tag is dropped.
This produces malformed XML output like:
<selected_documents>
# Reference Documents
...truncated content...
[... Middle content truncated ...]
...end portion...
Missing </selected_documents>.
🐛 Proposed fix: preserve closing tag
ATTACHMENT_PATTERN = re.compile(
- r"(?:<attachment>|<selected_documents>)?"
+ r"(?P<open_tag><attachment>|<selected_documents>)?"
r"(?:\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\]|"
r"# Reference Documents\s*\n\s*[^\n]*\n\s*)"
r"(.*?)"
- r"(?:</attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))",
+ r"(?P<close_tag></attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))",
re.DOTALL,
)Then in _truncate_attachments_proportionally, extract and append the closing tag:
# At end of truncate_match function
close_tag = match.group("close_tag") or ""
# Only include if it's an actual tag, not lookahead match
if close_tag.startswith("</"):
return header + begin_part + truncation_notice + end_part + close_tag
return header + begin_part + truncation_notice + end_part🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@chat_shell/chat_shell/compression/strategies.py` around lines 163 - 172, The
attachment-matching regex (ATTACHMENT_PATTERN) consumes the closing tag which
then gets dropped when rebuilding content in
_truncate_attachments_proportionally; fix by making the regex capture the
closing tag (e.g., as a named group "close_tag") or otherwise expose it from the
match, and then in _truncate_attachments_proportionally append that captured
close_tag (but only if it is an actual closing tag, e.g., startswith("</")) to
the reconstructed string (header + begin_part + truncation_notice + end_part +
close_tag) so the XML stay well-formed.
Summary
len // 4heuristic<selected_documents>recognition to compression strategies so the compressor can truncate notebook document content when neededRoot Cause
Two bugs working together caused the prompt overflow:
Inaccurate token estimation: The
total_chars // 4heuristic dramatically underestimates tokens for CJK content (Chinese characters use ~1.5 tokens/char, not 0.25). A 400k-char Chinese document estimates as 100k tokens but actually has 200k+ tokens, passing the injection threshold incorrectly.Compression blind spot: The
AttachmentTruncationStrategyonly recognizes<attachment>,[Attachment, etc. markers but NOT<selected_documents>tags used by notebook mode. This means injected document content is invisible to compression strategies and cannot be truncated as a safety net.Changes
backend/app/services/chat/preprocessing/selected_documents.py:_estimate_tokens()with CJK-aware token estimation (~1.5 tokens/CJK char, ~0.25 tokens/Latin char, 10% safety margin)MAX_INJECTION_RATIOfrom 0.5 to 0.3 to leave room for other prompt componentschat_shell/chat_shell/compression/strategies.py:<selected_documents>toDOCUMENT_MARKERSlistATTACHMENT_PATTERNregex to match<selected_documents>XML tags and# Reference Documentsheaders_truncate_attachments_proportionallyto handle both bracket-style and XML-style document formatsTest plan
<selected_documents>content when prompt exceeds context limitsSummary by CodeRabbit
Release Notes
Bug Fixes
Improvements