Merged
Conversation
* jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide
Added unit tests and comparison tests following OpenCC testing patterns.
1. Basic Unit Tests (src/JiebaSegmentationTest.cpp):
- BasicSegmentation: Validates basic Chinese word segmentation
- ComplexPhrase: Tests multi-word phrases and proper nouns
- EmptyString, SingleCharacter: Edge case handling
- EnglishAndChinese: Mixed language support
- UnknownWords: HMM model's ability to recognize unknown words
2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp):
- Follows t2cngov test pattern with external JSON test cases
- Loads test definitions from test/testcases/jieba_comparison_testcases.json
- Compares mmseg vs Jieba segmentation outputs
- Displays: Input, Jieba segments, Expected segments, Conversion outputs
- Converter caching for performance optimization
3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json):
- 15 comprehensive test cases covering:
* Simplified to Traditional (10 cases): s2twp vs s2twp_jieba
* Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba
Key test scenarios:
- jieba_s2t_001: 着名 ambiguity (wearing+name vs famous)
"生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女"
- jieba_s2t_002: Compound words (中学生, 中等身材)
"一个中学生,一个中等身材的人"
- jieba_t2s_001: Traditional 著名/為 conversion
"生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女"
- Other cases: Proper nouns, modern terms, mixed content,
ambiguous structures, Taiwan-specific vocabulary,
long compounds, classical Chinese
4. Focused Individual Tests:
- AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity
- TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion
Output Format:
=== Test: jieba_s2t_001 ===
Input: 生活着名为正敏的少女
Jieba segments: 生活/着/名为/正敏/的/少女
Expected segs: 生活/着/名为/正敏/的/少女
s2twp: 生活著名爲正敏的少女
s2twp_jieba: 生活著名為正敏的少女
Benefits:
- Visual comparison of segmentation algorithms
- Easy to add new test cases (just edit JSON)
- Documents expected behavior for ambiguous cases
- Validates that Jieba improves segmentation accuracy
- Test data can be reviewed independently from code
Build System Integration:
- Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON
- Automatically run with 'make test' or 'ctest'
reorder
Added two detailed documentation files:
1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines)
- Comprehensive feasibility analysis for integrating Jieba segmentation
- Compares two implementation approaches:
* cppjieba (C++ native) - RECOMMENDED
* Python embedding via pybind11 - Not recommended
- Technical analysis of Jieba's algorithm (Trie, DAG, HMM)
- Detailed implementation plan with code examples
- Performance, deployment, and maintenance comparison matrix
- 4-phase implementation roadmap
- Risk assessment and mitigation strategies
2. doc/JIEBA_USAGE.md
- Complete user guide for Jieba segmentation feature
- Compilation instructions with CMake
- Configuration file format and examples
- C++/CLI/Python API usage examples
- Custom user dictionary guide
- Performance considerations and benchmarks
- mmseg vs Jieba comparison table
- Troubleshooting guide
- Limitations and best practices
Key recommendations:
- Use cppjieba for production (performance, zero dependencies)
- Enable via -DENABLE_JIEBA=ON compile flag
- Experimental feature, opt-in only
b21537c to
3ceac6b
Compare
frankslin
added a commit
that referenced
this pull request
Jan 21, 2026
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License) ---- * Check in a copy of Jieba dictionary in data/jieba_dict/ for OpenCC: * jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide ---- * Implement (experimental) Jieba segmentation support ---- * Add comprehensive test suite for Jieba segmentation Added unit tests and comparison tests following OpenCC testing patterns. 1. Basic Unit Tests (src/JiebaSegmentationTest.cpp): - BasicSegmentation: Validates basic Chinese word segmentation - ComplexPhrase: Tests multi-word phrases and proper nouns - EmptyString, SingleCharacter: Edge case handling - EnglishAndChinese: Mixed language support - UnknownWords: HMM model's ability to recognize unknown words 2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp): - Follows t2cngov test pattern with external JSON test cases - Loads test definitions from test/testcases/jieba_comparison_testcases.json - Compares mmseg vs Jieba segmentation outputs - Displays: Input, Jieba segments, Expected segments, Conversion outputs - Converter caching for performance optimization 3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json): - 15 comprehensive test cases covering: * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba Key test scenarios: - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous) "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女" - jieba_s2t_002: Compound words (中学生, 中等身材) "一个中学生,一个中等身材的人" - jieba_t2s_001: Traditional 著名/為 conversion "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女" - Other cases: Proper nouns, modern terms, mixed content, ambiguous structures, Taiwan-specific vocabulary, long compounds, classical Chinese 4. Focused Individual Tests: - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion Output Format: === Test: jieba_s2t_001 === Input: 生活着名为正敏的少女 Jieba segments: 生活/着/名为/正敏/的/少女 Expected segs: 生活/着/名为/正敏/的/少女 s2twp: 生活著名爲正敏的少女 s2twp_jieba: 生活著名為正敏的少女 Benefits: - Visual comparison of segmentation algorithms - Easy to add new test cases (just edit JSON) - Documents expected behavior for ambiguous cases - Validates that Jieba improves segmentation accuracy - Test data can be reviewed independently from code Build System Integration: - Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON - Automatically run with 'make test' or 'ctest' reorder * Fix Jieba tests in Bazel and add more examples. * Add comprehensive Jieba segmentation documentation Added two detailed documentation files: 1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines) - Comprehensive feasibility analysis for integrating Jieba segmentation - Compares two implementation approaches: * cppjieba (C++ native) - RECOMMENDED * Python embedding via pybind11 - Not recommended - Technical analysis of Jieba's algorithm (Trie, DAG, HMM) - Detailed implementation plan with code examples - Performance, deployment, and maintenance comparison matrix - 4-phase implementation roadmap - Risk assessment and mitigation strategies 2. doc/JIEBA_USAGE.md - Complete user guide for Jieba segmentation feature - Compilation instructions with CMake - Configuration file format and examples - C++/CLI/Python API usage examples - Custom user dictionary guide - Performance considerations and benchmarks - mmseg vs Jieba comparison table - Troubleshooting guide - Limitations and best practices Key recommendations: - Use cppjieba for production (performance, zero dependencies) - Enable via -DENABLE_JIEBA=ON compile flag - Experimental feature, opt-in only ---- * Fix C++ compiler compatibility * Fix //python/tests:test_opencc --------- Co-authored-by: Claude <noreply@anthropic.com>
frankslin
added a commit
that referenced
this pull request
Jan 24, 2026
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License) ---- * Check in a copy of Jieba dictionary in data/jieba_dict/ for OpenCC: * jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide ---- * Implement (experimental) Jieba segmentation support ---- * Add comprehensive test suite for Jieba segmentation Added unit tests and comparison tests following OpenCC testing patterns. 1. Basic Unit Tests (src/JiebaSegmentationTest.cpp): - BasicSegmentation: Validates basic Chinese word segmentation - ComplexPhrase: Tests multi-word phrases and proper nouns - EmptyString, SingleCharacter: Edge case handling - EnglishAndChinese: Mixed language support - UnknownWords: HMM model's ability to recognize unknown words 2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp): - Follows t2cngov test pattern with external JSON test cases - Loads test definitions from test/testcases/jieba_comparison_testcases.json - Compares mmseg vs Jieba segmentation outputs - Displays: Input, Jieba segments, Expected segments, Conversion outputs - Converter caching for performance optimization 3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json): - 15 comprehensive test cases covering: * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba Key test scenarios: - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous) "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女" - jieba_s2t_002: Compound words (中学生, 中等身材) "一个中学生,一个中等身材的人" - jieba_t2s_001: Traditional 著名/為 conversion "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女" - Other cases: Proper nouns, modern terms, mixed content, ambiguous structures, Taiwan-specific vocabulary, long compounds, classical Chinese 4. Focused Individual Tests: - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion Output Format: === Test: jieba_s2t_001 === Input: 生活着名为正敏的少女 Jieba segments: 生活/着/名为/正敏/的/少女 Expected segs: 生活/着/名为/正敏/的/少女 s2twp: 生活著名爲正敏的少女 s2twp_jieba: 生活著名為正敏的少女 Benefits: - Visual comparison of segmentation algorithms - Easy to add new test cases (just edit JSON) - Documents expected behavior for ambiguous cases - Validates that Jieba improves segmentation accuracy - Test data can be reviewed independently from code Build System Integration: - Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON - Automatically run with 'make test' or 'ctest' reorder * Fix Jieba tests in Bazel and add more examples. * Add comprehensive Jieba segmentation documentation Added two detailed documentation files: 1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines) - Comprehensive feasibility analysis for integrating Jieba segmentation - Compares two implementation approaches: * cppjieba (C++ native) - RECOMMENDED * Python embedding via pybind11 - Not recommended - Technical analysis of Jieba's algorithm (Trie, DAG, HMM) - Detailed implementation plan with code examples - Performance, deployment, and maintenance comparison matrix - 4-phase implementation roadmap - Risk assessment and mitigation strategies 2. doc/JIEBA_USAGE.md - Complete user guide for Jieba segmentation feature - Compilation instructions with CMake - Configuration file format and examples - C++/CLI/Python API usage examples - Custom user dictionary guide - Performance considerations and benchmarks - mmseg vs Jieba comparison table - Troubleshooting guide - Limitations and best practices Key recommendations: - Use cppjieba for production (performance, zero dependencies) - Enable via -DENABLE_JIEBA=ON compile flag - Experimental feature, opt-in only ---- * Fix C++ compiler compatibility * Fix //python/tests:test_opencc --------- Co-authored-by: Claude <noreply@anthropic.com>
frankslin
added a commit
that referenced
this pull request
Jan 25, 2026
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License) ---- * Check in a copy of Jieba dictionary in data/jieba_dict/ for OpenCC: * jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide ---- * Implement (experimental) Jieba segmentation support ---- * Add comprehensive test suite for Jieba segmentation Added unit tests and comparison tests following OpenCC testing patterns. 1. Basic Unit Tests (src/JiebaSegmentationTest.cpp): - BasicSegmentation: Validates basic Chinese word segmentation - ComplexPhrase: Tests multi-word phrases and proper nouns - EmptyString, SingleCharacter: Edge case handling - EnglishAndChinese: Mixed language support - UnknownWords: HMM model's ability to recognize unknown words 2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp): - Follows t2cngov test pattern with external JSON test cases - Loads test definitions from test/testcases/jieba_comparison_testcases.json - Compares mmseg vs Jieba segmentation outputs - Displays: Input, Jieba segments, Expected segments, Conversion outputs - Converter caching for performance optimization 3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json): - 15 comprehensive test cases covering: * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba Key test scenarios: - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous) "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女" - jieba_s2t_002: Compound words (中学生, 中等身材) "一个中学生,一个中等身材的人" - jieba_t2s_001: Traditional 著名/為 conversion "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女" - Other cases: Proper nouns, modern terms, mixed content, ambiguous structures, Taiwan-specific vocabulary, long compounds, classical Chinese 4. Focused Individual Tests: - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion Output Format: === Test: jieba_s2t_001 === Input: 生活着名为正敏的少女 Jieba segments: 生活/着/名为/正敏/的/少女 Expected segs: 生活/着/名为/正敏/的/少女 s2twp: 生活著名爲正敏的少女 s2twp_jieba: 生活著名為正敏的少女 Benefits: - Visual comparison of segmentation algorithms - Easy to add new test cases (just edit JSON) - Documents expected behavior for ambiguous cases - Validates that Jieba improves segmentation accuracy - Test data can be reviewed independently from code Build System Integration: - Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON - Automatically run with 'make test' or 'ctest' reorder * Fix Jieba tests in Bazel and add more examples. * Add comprehensive Jieba segmentation documentation Added two detailed documentation files: 1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines) - Comprehensive feasibility analysis for integrating Jieba segmentation - Compares two implementation approaches: * cppjieba (C++ native) - RECOMMENDED * Python embedding via pybind11 - Not recommended - Technical analysis of Jieba's algorithm (Trie, DAG, HMM) - Detailed implementation plan with code examples - Performance, deployment, and maintenance comparison matrix - 4-phase implementation roadmap - Risk assessment and mitigation strategies 2. doc/JIEBA_USAGE.md - Complete user guide for Jieba segmentation feature - Compilation instructions with CMake - Configuration file format and examples - C++/CLI/Python API usage examples - Custom user dictionary guide - Performance considerations and benchmarks - mmseg vs Jieba comparison table - Troubleshooting guide - Limitations and best practices Key recommendations: - Use cppjieba for production (performance, zero dependencies) - Enable via -DENABLE_JIEBA=ON compile flag - Experimental feature, opt-in only ---- * Fix C++ compiler compatibility * Fix //python/tests:test_opencc --------- Co-authored-by: Claude <noreply@anthropic.com>
frankslin
added a commit
that referenced
this pull request
Jan 28, 2026
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License) ---- * Check in a copy of Jieba dictionary in data/jieba_dict/ for OpenCC: * jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies * hmm_model.utf8 (508KB): HMM model for unknown word recognition * user.dict.utf8: User-defined custom dictionary * README.md: Dictionary documentation and customization guide ---- * Implement (experimental) Jieba segmentation support ---- * Add comprehensive test suite for Jieba segmentation Added unit tests and comparison tests following OpenCC testing patterns. 1. Basic Unit Tests (src/JiebaSegmentationTest.cpp): - BasicSegmentation: Validates basic Chinese word segmentation - ComplexPhrase: Tests multi-word phrases and proper nouns - EmptyString, SingleCharacter: Edge case handling - EnglishAndChinese: Mixed language support - UnknownWords: HMM model's ability to recognize unknown words 2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp): - Follows t2cngov test pattern with external JSON test cases - Loads test definitions from test/testcases/jieba_comparison_testcases.json - Compares mmseg vs Jieba segmentation outputs - Displays: Input, Jieba segments, Expected segments, Conversion outputs - Converter caching for performance optimization 3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json): - 15 comprehensive test cases covering: * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba Key test scenarios: - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous) "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女" - jieba_s2t_002: Compound words (中学生, 中等身材) "一个中学生,一个中等身材的人" - jieba_t2s_001: Traditional 著名/為 conversion "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女" - Other cases: Proper nouns, modern terms, mixed content, ambiguous structures, Taiwan-specific vocabulary, long compounds, classical Chinese 4. Focused Individual Tests: - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion Output Format: === Test: jieba_s2t_001 === Input: 生活着名为正敏的少女 Jieba segments: 生活/着/名为/正敏/的/少女 Expected segs: 生活/着/名为/正敏/的/少女 s2twp: 生活著名爲正敏的少女 s2twp_jieba: 生活著名為正敏的少女 Benefits: - Visual comparison of segmentation algorithms - Easy to add new test cases (just edit JSON) - Documents expected behavior for ambiguous cases - Validates that Jieba improves segmentation accuracy - Test data can be reviewed independently from code Build System Integration: - Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON - Automatically run with 'make test' or 'ctest' reorder * Fix Jieba tests in Bazel and add more examples. * Add comprehensive Jieba segmentation documentation Added two detailed documentation files: 1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines) - Comprehensive feasibility analysis for integrating Jieba segmentation - Compares two implementation approaches: * cppjieba (C++ native) - RECOMMENDED * Python embedding via pybind11 - Not recommended - Technical analysis of Jieba's algorithm (Trie, DAG, HMM) - Detailed implementation plan with code examples - Performance, deployment, and maintenance comparison matrix - 4-phase implementation roadmap - Risk assessment and mitigation strategies 2. doc/JIEBA_USAGE.md - Complete user guide for Jieba segmentation feature - Compilation instructions with CMake - Configuration file format and examples - C++/CLI/Python API usage examples - Custom user dictionary guide - Performance considerations and benchmarks - mmseg vs Jieba comparison table - Troubleshooting guide - Limitations and best practices Key recommendations: - Use cppjieba for production (performance, zero dependencies) - Enable via -DENABLE_JIEBA=ON compile flag - Experimental feature, opt-in only ---- * Fix C++ compiler compatibility * Fix //python/tests:test_opencc --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.