尝试使用 jieba 分词库替换 mmseg 并验证效果 by frankslin · Pull Request #25 · frankslin/OpenCC

frankslin · 2026-01-18T01:26:21Z

No description provided.

* jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide

Added unit tests and comparison tests following OpenCC testing patterns. 1. Basic Unit Tests (src/JiebaSegmentationTest.cpp): - BasicSegmentation: Validates basic Chinese word segmentation - ComplexPhrase: Tests multi-word phrases and proper nouns - EmptyString, SingleCharacter: Edge case handling - EnglishAndChinese: Mixed language support - UnknownWords: HMM model's ability to recognize unknown words 2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp): - Follows t2cngov test pattern with external JSON test cases - Loads test definitions from test/testcases/jieba_comparison_testcases.json - Compares mmseg vs Jieba segmentation outputs - Displays: Input, Jieba segments, Expected segments, Conversion outputs - Converter caching for performance optimization 3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json): - 15 comprehensive test cases covering: * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba Key test scenarios: - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous) "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女" - jieba_s2t_002: Compound words (中学生, 中等身材) "一个中学生，一个中等身材的人" - jieba_t2s_001: Traditional 著名/為 conversion "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女" - Other cases: Proper nouns, modern terms, mixed content, ambiguous structures, Taiwan-specific vocabulary, long compounds, classical Chinese 4. Focused Individual Tests: - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion Output Format: === Test: jieba_s2t_001 === Input: 生活着名为正敏的少女 Jieba segments: 生活/着/名为/正敏/的/少女 Expected segs: 生活/着/名为/正敏/的/少女 s2twp: 生活著名爲正敏的少女 s2twp_jieba: 生活著名為正敏的少女 Benefits: - Visual comparison of segmentation algorithms - Easy to add new test cases (just edit JSON) - Documents expected behavior for ambiguous cases - Validates that Jieba improves segmentation accuracy - Test data can be reviewed independently from code Build System Integration: - Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON - Automatically run with 'make test' or 'ctest' reorder

Added two detailed documentation files: 1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines) - Comprehensive feasibility analysis for integrating Jieba segmentation - Compares two implementation approaches: * cppjieba (C++ native) - RECOMMENDED * Python embedding via pybind11 - Not recommended - Technical analysis of Jieba's algorithm (Trie, DAG, HMM) - Detailed implementation plan with code examples - Performance, deployment, and maintenance comparison matrix - 4-phase implementation roadmap - Risk assessment and mitigation strategies 2. doc/JIEBA_USAGE.md - Complete user guide for Jieba segmentation feature - Compilation instructions with CMake - Configuration file format and examples - C++/CLI/Python API usage examples - Custom user dictionary guide - Performance considerations and benchmarks - mmseg vs Jieba comparison table - Troubleshooting guide - Limitations and best practices Key recommendations: - Use cppjieba for production (performance, zero dependencies) - Enable via -DENABLE_JIEBA=ON compile flag - Experimental feature, opt-in only

* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License) ---- * Check in a copy of Jieba dictionary in data/jieba_dict/ for OpenCC: * jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide ---- * Implement (experimental) Jieba segmentation support ---- * Add comprehensive test suite for Jieba segmentation Added unit tests and comparison tests following OpenCC testing patterns. 1. Basic Unit Tests (src/JiebaSegmentationTest.cpp): - BasicSegmentation: Validates basic Chinese word segmentation - ComplexPhrase: Tests multi-word phrases and proper nouns - EmptyString, SingleCharacter: Edge case handling - EnglishAndChinese: Mixed language support - UnknownWords: HMM model's ability to recognize unknown words 2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp): - Follows t2cngov test pattern with external JSON test cases - Loads test definitions from test/testcases/jieba_comparison_testcases.json - Compares mmseg vs Jieba segmentation outputs - Displays: Input, Jieba segments, Expected segments, Conversion outputs - Converter caching for performance optimization 3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json): - 15 comprehensive test cases covering: * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba Key test scenarios: - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous) "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女" - jieba_s2t_002: Compound words (中学生, 中等身材) "一个中学生，一个中等身材的人" - jieba_t2s_001: Traditional 著名/為 conversion "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女" - Other cases: Proper nouns, modern terms, mixed content, ambiguous structures, Taiwan-specific vocabulary, long compounds, classical Chinese 4. Focused Individual Tests: - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion Output Format: === Test: jieba_s2t_001 === Input: 生活着名为正敏的少女 Jieba segments: 生活/着/名为/正敏/的/少女 Expected segs: 生活/着/名为/正敏/的/少女 s2twp: 生活著名爲正敏的少女 s2twp_jieba: 生活著名為正敏的少女 Benefits: - Visual comparison of segmentation algorithms - Easy to add new test cases (just edit JSON) - Documents expected behavior for ambiguous cases - Validates that Jieba improves segmentation accuracy - Test data can be reviewed independently from code Build System Integration: - Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON - Automatically run with 'make test' or 'ctest' reorder * Fix Jieba tests in Bazel and add more examples. * Add comprehensive Jieba segmentation documentation Added two detailed documentation files: 1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines) - Comprehensive feasibility analysis for integrating Jieba segmentation - Compares two implementation approaches: * cppjieba (C++ native) - RECOMMENDED * Python embedding via pybind11 - Not recommended - Technical analysis of Jieba's algorithm (Trie, DAG, HMM) - Detailed implementation plan with code examples - Performance, deployment, and maintenance comparison matrix - 4-phase implementation roadmap - Risk assessment and mitigation strategies 2. doc/JIEBA_USAGE.md - Complete user guide for Jieba segmentation feature - Compilation instructions with CMake - Configuration file format and examples - C++/CLI/Python API usage examples - Custom user dictionary guide - Performance considerations and benchmarks - mmseg vs Jieba comparison table - Troubleshooting guide - Limitations and best practices Key recommendations: - Use cppjieba for production (performance, zero dependencies) - Enable via -DENABLE_JIEBA=ON compile flag - Experimental feature, opt-in only ---- * Fix C++ compiler compatibility * Fix //python/tests:test_opencc --------- Co-authored-by: Claude <noreply@anthropic.com>

frankslin and others added 6 commits January 17, 2026 16:41

Check in a complete copy of libcppjieba from https://github.com/yanyi…

c6f4957

…wu/cppjieba (MIT License)

Implement (experimental) Jieba segmentation support

6aa9f71

Fix Jieba tests in Bazel and add more examples.

991639d

frankslin force-pushed the claude/explore-jieba-segmentation-XHvIj branch from b21537c to 3ceac6b Compare January 18, 2026 01:30

Repository owner deleted a comment from chatgpt-codex-connector bot Jan 18, 2026

frankslin added 2 commits January 17, 2026 18:10

Fix C++ compiler compatibility

dad6363

Fix //python/tests:test_opencc

a44f08c

frankslin merged commit dc223a8 into master Jan 18, 2026
25 checks passed

frankslin deleted the claude/explore-jieba-segmentation-XHvIj branch January 18, 2026 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

尝试使用 jieba 分词库替换 mmseg 并验证效果#25

尝试使用 jieba 分词库替换 mmseg 并验证效果#25
frankslin merged 8 commits intomasterfrom
claude/explore-jieba-segmentation-XHvIj

frankslin commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

frankslin commented Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants