fix(chat): prevent prompt overflow when referencing knowledge documents by 2561056571 · Pull Request #697 · wecode-ai/Wegent

2561056571 · 2026-03-09T10:26:23Z

Summary

Fix "prompt is too long: 244418 tokens > 200000 maximum" error when referencing knowledge documents in notebook chat mode
Add content-aware token estimation that properly handles CJK (Chinese/Japanese/Korean) text instead of naive len // 4 heuristic
Reduce document injection threshold from 50% to 30% of context window to reserve room for system prompt, history, and dynamic context
Add <selected_documents> recognition to compression strategies so the compressor can truncate notebook document content when needed

Root Cause

Two bugs working together caused the prompt overflow:

Inaccurate token estimation: The total_chars // 4 heuristic dramatically underestimates tokens for CJK content (Chinese characters use ~1.5 tokens/char, not 0.25). A 400k-char Chinese document estimates as 100k tokens but actually has 200k+ tokens, passing the injection threshold incorrectly.
Compression blind spot: The AttachmentTruncationStrategy only recognizes <attachment>, [Attachment, etc. markers but NOT <selected_documents> tags used by notebook mode. This means injected document content is invisible to compression strategies and cannot be truncated as a safety net.

Changes

backend/app/services/chat/preprocessing/selected_documents.py:
- Add _estimate_tokens() with CJK-aware token estimation (~1.5 tokens/CJK char, ~0.25 tokens/Latin char, 10% safety margin)
- Reduce MAX_INJECTION_RATIO from 0.5 to 0.3 to leave room for other prompt components
- Add CJK character detection regex pattern
chat_shell/chat_shell/compression/strategies.py:
- Add <selected_documents> to DOCUMENT_MARKERS list
- Update ATTACHMENT_PATTERN regex to match <selected_documents> XML tags and # Reference Documents headers
- Fix header extraction in _truncate_attachments_proportionally to handle both bracket-style and XML-style document formats

Test plan

All 1102 backend tests pass
All 41 compression tests pass
All 334 chat_shell tests pass (1 pre-existing unrelated failure excluded)
Verify notebook chat with large CJK knowledge documents falls back to RAG instead of direct injection
Verify notebook chat with smaller documents still uses direct injection correctly
Verify compression can truncate <selected_documents> content when prompt exceeds context limits

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved token counting for multilingual content in documents
- Enhanced attachment format recognition and header extraction during truncation
Improvements
- Optimized context window allocation for better document processing

The notebook chat page throws "prompt is too long" error when referencing knowledge documents because of two issues: 1. Token estimation uses naive `len // 4` heuristic which dramatically underestimates tokens for CJK content (Chinese/Japanese/Korean chars use ~1.5 tokens per character, not 0.25). This causes large documents to pass the injection threshold when they shouldn't. 2. The compression system's AttachmentTruncationStrategy doesn't recognize `<selected_documents>` XML tags, so injected document content is invisible to compression strategies and cannot be truncated when the prompt exceeds context limits. Changes: - Add content-aware token estimation that handles CJK text properly - Reduce MAX_INJECTION_RATIO from 0.5 to 0.3 to reserve room for system prompt, chat history, and dynamic context - Add `<selected_documents>` to compression strategy's document markers so the compressor can detect and truncate notebook document content - Update regex pattern and header extraction to handle both bracket-style and XML-style document formats

coderabbitai · 2026-03-09T10:26:38Z

📝 Walkthrough

Walkthrough

Two backend modules updated: token estimation improved using CJK character detection with 1.5 and 0.25 token rates respectively, MAX_INJECTION_RATIO reduced from 0.5 to 0.3, and attachment pattern recognition extended to handle <selected_documents> blocks with enhanced header extraction logic.

Changes

Cohort / File(s)	Summary
Token Estimation Logic `backend/app/services/chat/preprocessing/selected_documents.py`	Replaced naive character-based token counting with content-aware `_estimate_tokens()` function using CJK character detection. Differentiates between CJK (∼1.5 tokens/char) and non-CJK (∼0.25 tokens/char) content with 10% safety margin. Reduced MAX_INJECTION_RATIO from 0.5 to 0.3 and expanded debug logging.
Attachment Pattern Recognition `chat_shell/chat_shell/compression/strategies.py`	Extended ATTACHMENT_PATTERN regex and DOCUMENT_MARKERS to recognize attachments in `<selected_documents>` blocks alongside `<attachment>` tags. Improved header extraction during truncation to conditionally use bracketed or XML-style headers based on content position detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 With whiskers twitched and paws held high,
We count the tokens, CJK-style,
From East to West, each character flies,
While attachments hide no more behind disguise—
Better headers, safer margins, all the while! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main objective of the pull request: preventing prompt overflow when referencing knowledge documents through improved token estimation and compression handling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

backend/app/services/chat/preprocessing/selected_documents.py (3)

226-231: Use finditer instead of findall for memory efficiency.

findall(text) creates a list containing every CJK character as a separate string. For large CJK-heavy documents (e.g., a 5MB Chinese document), this allocates millions of single-character strings just to count them.
♻️ More efficient counting with iterator
-    cjk_chars = len(_CJK_PATTERN.findall(text))
+    cjk_chars = sum(1 for _ in _CJK_PATTERN.finditer(text))
     non_cjk_chars = len(text) - cjk_chars
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines
226 - 231, The current counting uses _CJK_PATTERN.findall(text) which builds a
list of every match; replace that with an iterator-based count using
_CJK_PATTERN.finditer(text) and compute cjk_chars by iterating and counting
matches (e.g., sum over the iterator) to avoid allocating millions of strings;
keep the subsequent non_cjk_chars and estimated calculations unchanged and
update the cjk_chars assignment in the block that currently defines cjk_chars,
non_cjk_chars, and estimated.
229-234: Extract magic numbers to named constants.

The token estimation ratios and safety margin are meaningful values that would benefit from named constants for clarity and maintainability.
♻️ Extract to named constants

Add near the top of the file:
# Token estimation ratios (conservative estimates for injection safety)
CJK_TOKENS_PER_CHAR = 1.5  # CJK characters typically produce 1-2 tokens
NON_CJK_TOKENS_PER_CHAR = 0.25  # ~4 ASCII/Latin characters per token
TOKEN_SAFETY_MARGIN = 1.1  # 10% buffer for tokenizer overhead
Then in _estimate_tokens:
-    estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25)
-    estimated = int(estimated * 1.1)
+    estimated = int(cjk_chars * CJK_TOKENS_PER_CHAR + non_cjk_chars * NON_CJK_TOKENS_PER_CHAR)
+    estimated = int(estimated * TOKEN_SAFETY_MARGIN)
As per coding guidelines, "Extract magic numbers to named constants in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines
229 - 234, Extract the numeric literals in the token estimation logic into named
module-level constants and use them inside _estimate_tokens: define
CJK_TOKENS_PER_CHAR = 1.5, NON_CJK_TOKENS_PER_CHAR = 0.25 and
TOKEN_SAFETY_MARGIN = 1.1 (or similar descriptive names) near the top of the
file, then replace the magic numbers in the _estimate_tokens calculation (the
lines computing estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25) and
estimated = int(estimated * 1.1)) to reference these constants for clarity and
maintainability.
206-242: Note: Token estimation inconsistency with RAG indexer.

The new _estimate_tokens uses CJK-aware estimation (1.5 tokens/CJK char), but the RAG indexer at backend/app/services/rag/index/indexer.py:290-307 uses the naive len(text) // 4 formula. For CJK-heavy documents, these estimates can differ by ~6x.

This doesn't cause functional issues here (conservative estimation is the goal), but may be worth unifying if accurate cross-pipeline token accounting becomes important.

Consider extracting _estimate_tokens to a shared utility (e.g., backend/app/utils/tokens.py) that both modules can use for consistent estimation. As per coding guidelines, common logic should be extracted into shared utilities.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/chat/preprocessing/selected_documents.py` around lines
206 - 242, Extract the CJK-aware logic in _estimate_tokens into a shared utility
function (e.g., estimate_tokens) in a new module tokens.py and replace the local
implementations so both places use the same function; specifically, move the
body of _estimate_tokens into backend/app/utils/tokens.py as
estimate_tokens(text: str) -> int, export it, then update the current
_estimate_tokens caller to import and call estimate_tokens, and replace the
naive len(text) // 4 usage in the RAG indexer (the token estimation site using
len(text) // 4) to call estimate_tokens(text) instead so both pipelines share
consistent token accounting.

chat_shell/chat_shell/compression/strategies.py (1)

492-504: Header extraction logic is correct but could benefit from a clarifying comment.

The position-based comparison correctly handles both formats. However, consider the edge case where attachment_content is empty — content_start would be -1 (from .find() on empty string not found), and the logic would correctly fall through to header = "".

♻️ Optional: Minor clarity improvement

             # and XML-style headers (e.g., "<selected_documents># Reference Documents\n...")
             bracket_pos = full_match.find("]")
             content_start = (
                 full_match.find(attachment_content) if attachment_content else -1
             )
+            # Choose the delimiter that appears first to extract the header
             if bracket_pos >= 0 and (content_start < 0 or bracket_pos < content_start):
                 header = full_match[: bracket_pos + 1]
             elif content_start > 0:
                 header = full_match[:content_start]
             else:
+                # No recognizable header structure
                 header = ""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@chat_shell/chat_shell/compression/strategies.py` around lines 492 - 504,
Clarify the header extraction in the block using full_match, attachment_content,
bracket_pos, content_start and header by adding a concise comment that explains
content_start is set via .find(...) which yields -1 when attachment_content
isn't present (including empty/None cases), and that the branch ordering
(bracket_pos >= 0 and (content_start < 0 or bracket_pos < content_start))
intentionally prefers bracket-style headers when a closing ] occurs before any
attachment_content match, otherwise slices up to content_start, and falls back
to header = "" when no header marker is found; this makes the edge case of
attachment_content not found explicit for future readers.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@chat_shell/chat_shell/compression/strategies.py`:
- Around line 163-172: The attachment-matching regex (ATTACHMENT_PATTERN)
consumes the closing tag which then gets dropped when rebuilding content in
_truncate_attachments_proportionally; fix by making the regex capture the
closing tag (e.g., as a named group "close_tag") or otherwise expose it from the
match, and then in _truncate_attachments_proportionally append that captured
close_tag (but only if it is an actual closing tag, e.g., startswith("</")) to
the reconstructed string (header + begin_part + truncation_notice + end_part +
close_tag) so the XML stay well-formed.

---

Nitpick comments:
In `@backend/app/services/chat/preprocessing/selected_documents.py`:
- Around line 226-231: The current counting uses _CJK_PATTERN.findall(text)
which builds a list of every match; replace that with an iterator-based count
using _CJK_PATTERN.finditer(text) and compute cjk_chars by iterating and
counting matches (e.g., sum over the iterator) to avoid allocating millions of
strings; keep the subsequent non_cjk_chars and estimated calculations unchanged
and update the cjk_chars assignment in the block that currently defines
cjk_chars, non_cjk_chars, and estimated.
- Around line 229-234: Extract the numeric literals in the token estimation
logic into named module-level constants and use them inside _estimate_tokens:
define CJK_TOKENS_PER_CHAR = 1.5, NON_CJK_TOKENS_PER_CHAR = 0.25 and
TOKEN_SAFETY_MARGIN = 1.1 (or similar descriptive names) near the top of the
file, then replace the magic numbers in the _estimate_tokens calculation (the
lines computing estimated = int(cjk_chars * 1.5 + non_cjk_chars * 0.25) and
estimated = int(estimated * 1.1)) to reference these constants for clarity and
maintainability.
- Around line 206-242: Extract the CJK-aware logic in _estimate_tokens into a
shared utility function (e.g., estimate_tokens) in a new module tokens.py and
replace the local implementations so both places use the same function;
specifically, move the body of _estimate_tokens into backend/app/utils/tokens.py
as estimate_tokens(text: str) -> int, export it, then update the current
_estimate_tokens caller to import and call estimate_tokens, and replace the
naive len(text) // 4 usage in the RAG indexer (the token estimation site using
len(text) // 4) to call estimate_tokens(text) instead so both pipelines share
consistent token accounting.

In `@chat_shell/chat_shell/compression/strategies.py`:
- Around line 492-504: Clarify the header extraction in the block using
full_match, attachment_content, bracket_pos, content_start and header by adding
a concise comment that explains content_start is set via .find(...) which yields
-1 when attachment_content isn't present (including empty/None cases), and that
the branch ordering (bracket_pos >= 0 and (content_start < 0 or bracket_pos <
content_start)) intentionally prefers bracket-style headers when a closing ]
occurs before any attachment_content match, otherwise slices up to
content_start, and falls back to header = "" when no header marker is found;
this makes the edge case of attachment_content not found explicit for future
readers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61d1af2e-a46c-4e0c-adad-fec9d90f4817

📥 Commits

Reviewing files that changed from the base of the PR and between 835922b and d917086.

📒 Files selected for processing (2)

backend/app/services/chat/preprocessing/selected_documents.py
chat_shell/chat_shell/compression/strategies.py

coderabbitai · 2026-03-09T10:32:07Z

chat_shell/chat_shell/compression/strategies.py

+    # Pattern to match attachment content blocks wrapped in <attachment> XML tags,
+    # <selected_documents> XML tags, or legacy format without tags
    ATTACHMENT_PATTERN = re.compile(
-        r"(?:<attachment>)?\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\](.*?)(?:</attachment>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|$))",
+        r"(?:<attachment>|<selected_documents>)?"
+        r"(?:\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\]|"
+        r"# Reference Documents\s*\n\s*[^\n]*\n\s*)"
+        r"(.*?)"
+        r"(?:</attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))",
        re.DOTALL,
    )


⚠️ Potential issue | 🟠 Major

Closing tag is lost during truncation.

The regex correctly matches <selected_documents> opening and closing tags, but the closing tag is consumed by the match (not a lookahead) and included in full_match.group(0). However, when _truncate_attachments_proportionally reconstructs the truncated content at line 531, it only returns header + begin_part + truncation_notice + end_part — the closing tag is dropped.

This produces malformed XML output like:

<selected_documents> # Reference Documents ...truncated content... [... Middle content truncated ...] ...end portion...

Missing </selected_documents>.

🐛 Proposed fix: preserve closing tag

ATTACHMENT_PATTERN = re.compile( - r"(?:<attachment>|<selected_documents>)?" + r"(?P<open_tag><attachment>|<selected_documents>)?" r"(?:\[(?:Attachment \d+|File Content|Document)(?:\s*[:-]\s*[^\]]+)?\]|" r"# Reference Documents\s*\n\s*[^\n]*\n\s*)" r"(.*?)" - r"(?:</attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))", + r"(?P<close_tag></attachment>|</selected_documents>|(?=\[(?:Attachment \d+|File Content|Document)|<attachment>|<selected_documents>|$))", re.DOTALL, )

Then in _truncate_attachments_proportionally, extract and append the closing tag:

# At end of truncate_match function close_tag = match.group("close_tag") or "" # Only include if it's an actual tag, not lookahead match if close_tag.startswith("</"): return header + begin_part + truncation_notice + end_part + close_tag return header + begin_part + truncation_notice + end_part

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@chat_shell/chat_shell/compression/strategies.py` around lines 163 - 172, The attachment-matching regex (ATTACHMENT_PATTERN) consumes the closing tag which then gets dropped when rebuilding content in _truncate_attachments_proportionally; fix by making the regex capture the closing tag (e.g., as a named group "close_tag") or otherwise expose it from the match, and then in _truncate_attachments_proportionally append that captured close_tag (but only if it is an actual closing tag, e.g., startswith("</")) to the reconstructed string (header + begin_part + truncation_notice + end_part + close_tag) so the XML stay well-formed.

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chat): prevent prompt overflow when referencing knowledge documents#697

fix(chat): prevent prompt overflow when referencing knowledge documents#697
2561056571 wants to merge 1 commit intowecode-ai:mainfrom
2561056571:wegent/fix-knowledge-prompt-token-overflow

2561056571 commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

2561056571 commented Mar 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2561056571 commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading