Skip to content

fix: add Chinese noise filter patterns#49

Open
PanYuxn wants to merge 1 commit intowin4r:mainfrom
PanYuxn:fix/chinese-noise-filter
Open

fix: add Chinese noise filter patterns#49
PanYuxn wants to merge 1 commit intowin4r:mainfrom
PanYuxn:fix/chinese-noise-filter

Conversation

@PanYuxn
Copy link

@PanYuxn PanYuxn commented Mar 4, 2026

Closes #48

Problem

noise-filter.ts only contains English patterns. Chinese greetings (你好), denials (我不记得), meta-questions (你还记得吗), and acknowledgments (好的, 谢谢) pass through unfiltered and get stored as memories — polluting retrieval results over time.

This is especially impactful with autoCapture enabled, as every 好的 and 谢谢 becomes a permanent memory entry.

Solution

Add Chinese patterns to all three filter categories:

Category Examples Count
Denial 我不记得, 找不到相关记忆, 我没有相关信息 8 patterns
Meta-question 你还记得吗, 我之前提到过, 我有没有说过 6 patterns
Boilerplate 你好, 早上好, 好的, 谢谢, 知道了 7+2 patterns

False-positive prevention

Chinese acknowledgment words (好的, 谢谢) often appear at the start of meaningful sentences:

  • 好的方案是使用Redis做缓存层 — should NOT be filtered
  • 谢谢分享,我觉得这个思路很好 — should NOT be filtered

Solution: length-gated SHORT_BOILERPLATE_PATTERNS — prefix matches like 好的/谢谢 are only treated as noise when total text length ≤ 10 characters. Short = filler, long = real content.

Test Results

52/52 tests pass:

  • 38 noise cases correctly filtered (9 denial + 7 meta-question + 19 boilerplate + 3 English)
  • 8 false-positive guards all pass (meaningful text kept)
  • 6 integration test assertions pass

Changes

  • src/noise-filter.ts — Add Chinese patterns + SHORT_BOILERPLATE_PATTERNS with length gate
  • test/noise-filter-chinese.mjs — New test: 52 cases covering all categories + false-positive protection

No new dependencies. Existing npm test passes.

The noise filter only contained English patterns, causing Chinese
greetings, denials, and meta-questions to pass through unfiltered
and pollute the memory database.

Add Chinese patterns for all three categories:
- Denial: 我不记得, 找不到相关记忆, 我没有相关信息, etc.
- Meta-question: 你还记得吗, 我之前提到过, etc.
- Boilerplate: 你好, 早上好, 好的, 谢谢, etc.

Use a length-gated SHORT_BOILERPLATE_PATTERNS strategy to prevent
false positives: acknowledgment prefixes (好的, 谢谢) are only
filtered when total text ≤ 10 chars, so "谢谢你的帮助" is noise
but "谢谢分享,我觉得这个思路很好" is kept.

Includes test with 52 cases (38 noise + 8 false-positive guards + 6 integration).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AliceLJY
Copy link
Collaborator

AliceLJY commented Mar 5, 2026

PR #49 Review Summary

Tested on Gateway 2026.3.2 + claude-opus-4-6. Unit tests 52/52 pass, runtime memory_store verified on Discord.

One issue to fix before merge:

/你知道.*吗/ in META_QUESTION_PATTERNS is too broad — it catches legitimate questions like "你知道Redis的持久化策略吗". Suggest narrowing to /你知道我(说过|提过|告诉).*吗/ or removing it (the other 5 Chinese meta-question patterns already cover the intent).

Everything else looks good. The SHORT_BOILERPLATE + length-gate design is clever and well-tested. Approve after the regex fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Noise filter only matches English patterns — Chinese noise passes through unfiltered

2 participants