Skip to content

Comments

尝试使用 jieba 分词库替换 mmseg 并验证效果#25

Merged
frankslin merged 8 commits intomasterfrom
claude/explore-jieba-segmentation-XHvIj
Jan 18, 2026
Merged

尝试使用 jieba 分词库替换 mmseg 并验证效果#25
frankslin merged 8 commits intomasterfrom
claude/explore-jieba-segmentation-XHvIj

Conversation

@frankslin
Copy link
Owner

No description provided.

frankslin and others added 6 commits January 17, 2026 16:41
* jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies
* hmm_model.utf8 (508KB): HMM model for unknown word recognition
* user.dict.utf8: User-defined custom dictionary
* README.md: Dictionary documentation and customization guide
Added unit tests and comparison tests following OpenCC testing patterns.

1. Basic Unit Tests (src/JiebaSegmentationTest.cpp):
   - BasicSegmentation: Validates basic Chinese word segmentation
   - ComplexPhrase: Tests multi-word phrases and proper nouns
   - EmptyString, SingleCharacter: Edge case handling
   - EnglishAndChinese: Mixed language support
   - UnknownWords: HMM model's ability to recognize unknown words

2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp):
   - Follows t2cngov test pattern with external JSON test cases
   - Loads test definitions from test/testcases/jieba_comparison_testcases.json
   - Compares mmseg vs Jieba segmentation outputs
   - Displays: Input, Jieba segments, Expected segments, Conversion outputs
   - Converter caching for performance optimization

3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json):
   - 15 comprehensive test cases covering:
     * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba
     * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba

   Key test scenarios:
   - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous)
     "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女"

   - jieba_s2t_002: Compound words (中学生, 中等身材)
     "一个中学生,一个中等身材的人"

   - jieba_t2s_001: Traditional 著名/為 conversion
     "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女"

   - Other cases: Proper nouns, modern terms, mixed content,
     ambiguous structures, Taiwan-specific vocabulary,
     long compounds, classical Chinese

4. Focused Individual Tests:
   - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity
   - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion

Output Format:
  === Test: jieba_s2t_001 ===
  Input:          生活着名为正敏的少女
  Jieba segments: 生活/着/名为/正敏/的/少女
  Expected segs:  生活/着/名为/正敏/的/少女
  s2twp:          生活著名爲正敏的少女
  s2twp_jieba:    生活著名為正敏的少女

Benefits:
- Visual comparison of segmentation algorithms
- Easy to add new test cases (just edit JSON)
- Documents expected behavior for ambiguous cases
- Validates that Jieba improves segmentation accuracy
- Test data can be reviewed independently from code

Build System Integration:
- Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON
- Automatically run with 'make test' or 'ctest'

reorder
Added two detailed documentation files:

1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines)
   - Comprehensive feasibility analysis for integrating Jieba segmentation
   - Compares two implementation approaches:
     * cppjieba (C++ native) - RECOMMENDED
     * Python embedding via pybind11 - Not recommended
   - Technical analysis of Jieba's algorithm (Trie, DAG, HMM)
   - Detailed implementation plan with code examples
   - Performance, deployment, and maintenance comparison matrix
   - 4-phase implementation roadmap
   - Risk assessment and mitigation strategies

2. doc/JIEBA_USAGE.md
   - Complete user guide for Jieba segmentation feature
   - Compilation instructions with CMake
   - Configuration file format and examples
   - C++/CLI/Python API usage examples
   - Custom user dictionary guide
   - Performance considerations and benchmarks
   - mmseg vs Jieba comparison table
   - Troubleshooting guide
   - Limitations and best practices

Key recommendations:
- Use cppjieba for production (performance, zero dependencies)
- Enable via -DENABLE_JIEBA=ON compile flag
- Experimental feature, opt-in only
@frankslin frankslin force-pushed the claude/explore-jieba-segmentation-XHvIj branch from b21537c to 3ceac6b Compare January 18, 2026 01:30
Repository owner deleted a comment from chatgpt-codex-connector bot Jan 18, 2026
@frankslin frankslin merged commit dc223a8 into master Jan 18, 2026
25 checks passed
@frankslin frankslin deleted the claude/explore-jieba-segmentation-XHvIj branch January 18, 2026 04:50
frankslin added a commit that referenced this pull request Jan 21, 2026
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License)

----
* Check in a copy of Jieba dictionary in  data/jieba_dict/ for OpenCC:

* jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies
* hmm_model.utf8 (508KB): HMM model for unknown word recognition
* user.dict.utf8: User-defined custom dictionary
* README.md: Dictionary documentation and customization guide

----
* Implement (experimental) Jieba segmentation support

----
* Add comprehensive test suite for Jieba segmentation

Added unit tests and comparison tests following OpenCC testing patterns.

1. Basic Unit Tests (src/JiebaSegmentationTest.cpp):
   - BasicSegmentation: Validates basic Chinese word segmentation
   - ComplexPhrase: Tests multi-word phrases and proper nouns
   - EmptyString, SingleCharacter: Edge case handling
   - EnglishAndChinese: Mixed language support
   - UnknownWords: HMM model's ability to recognize unknown words

2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp):
   - Follows t2cngov test pattern with external JSON test cases
   - Loads test definitions from test/testcases/jieba_comparison_testcases.json
   - Compares mmseg vs Jieba segmentation outputs
   - Displays: Input, Jieba segments, Expected segments, Conversion outputs
   - Converter caching for performance optimization

3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json):
   - 15 comprehensive test cases covering:
     * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba
     * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba

   Key test scenarios:
   - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous)
     "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女"

   - jieba_s2t_002: Compound words (中学生, 中等身材)
     "一个中学生,一个中等身材的人"

   - jieba_t2s_001: Traditional 著名/為 conversion
     "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女"

   - Other cases: Proper nouns, modern terms, mixed content,
     ambiguous structures, Taiwan-specific vocabulary,
     long compounds, classical Chinese

4. Focused Individual Tests:
   - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity
   - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion

Output Format:
  === Test: jieba_s2t_001 ===
  Input:          生活着名为正敏的少女
  Jieba segments: 生活/着/名为/正敏/的/少女
  Expected segs:  生活/着/名为/正敏/的/少女
  s2twp:          生活著名爲正敏的少女
  s2twp_jieba:    生活著名為正敏的少女

Benefits:
- Visual comparison of segmentation algorithms
- Easy to add new test cases (just edit JSON)
- Documents expected behavior for ambiguous cases
- Validates that Jieba improves segmentation accuracy
- Test data can be reviewed independently from code

Build System Integration:
- Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON
- Automatically run with 'make test' or 'ctest'

reorder

* Fix Jieba tests in Bazel and add more examples.

* Add comprehensive Jieba segmentation documentation

Added two detailed documentation files:

1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines)
   - Comprehensive feasibility analysis for integrating Jieba segmentation
   - Compares two implementation approaches:
     * cppjieba (C++ native) - RECOMMENDED
     * Python embedding via pybind11 - Not recommended
   - Technical analysis of Jieba's algorithm (Trie, DAG, HMM)
   - Detailed implementation plan with code examples
   - Performance, deployment, and maintenance comparison matrix
   - 4-phase implementation roadmap
   - Risk assessment and mitigation strategies

2. doc/JIEBA_USAGE.md
   - Complete user guide for Jieba segmentation feature
   - Compilation instructions with CMake
   - Configuration file format and examples
   - C++/CLI/Python API usage examples
   - Custom user dictionary guide
   - Performance considerations and benchmarks
   - mmseg vs Jieba comparison table
   - Troubleshooting guide
   - Limitations and best practices

Key recommendations:
- Use cppjieba for production (performance, zero dependencies)
- Enable via -DENABLE_JIEBA=ON compile flag
- Experimental feature, opt-in only

----
* Fix C++ compiler compatibility
* Fix //python/tests:test_opencc

---------

Co-authored-by: Claude <noreply@anthropic.com>
frankslin added a commit that referenced this pull request Jan 24, 2026
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License)

----
* Check in a copy of Jieba dictionary in  data/jieba_dict/ for OpenCC:

* jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies
* hmm_model.utf8 (508KB): HMM model for unknown word recognition
* user.dict.utf8: User-defined custom dictionary
* README.md: Dictionary documentation and customization guide

----
* Implement (experimental) Jieba segmentation support

----
* Add comprehensive test suite for Jieba segmentation

Added unit tests and comparison tests following OpenCC testing patterns.

1. Basic Unit Tests (src/JiebaSegmentationTest.cpp):
   - BasicSegmentation: Validates basic Chinese word segmentation
   - ComplexPhrase: Tests multi-word phrases and proper nouns
   - EmptyString, SingleCharacter: Edge case handling
   - EnglishAndChinese: Mixed language support
   - UnknownWords: HMM model's ability to recognize unknown words

2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp):
   - Follows t2cngov test pattern with external JSON test cases
   - Loads test definitions from test/testcases/jieba_comparison_testcases.json
   - Compares mmseg vs Jieba segmentation outputs
   - Displays: Input, Jieba segments, Expected segments, Conversion outputs
   - Converter caching for performance optimization

3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json):
   - 15 comprehensive test cases covering:
     * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba
     * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba

   Key test scenarios:
   - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous)
     "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女"

   - jieba_s2t_002: Compound words (中学生, 中等身材)
     "一个中学生,一个中等身材的人"

   - jieba_t2s_001: Traditional 著名/為 conversion
     "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女"

   - Other cases: Proper nouns, modern terms, mixed content,
     ambiguous structures, Taiwan-specific vocabulary,
     long compounds, classical Chinese

4. Focused Individual Tests:
   - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity
   - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion

Output Format:
  === Test: jieba_s2t_001 ===
  Input:          生活着名为正敏的少女
  Jieba segments: 生活/着/名为/正敏/的/少女
  Expected segs:  生活/着/名为/正敏/的/少女
  s2twp:          生活著名爲正敏的少女
  s2twp_jieba:    生活著名為正敏的少女

Benefits:
- Visual comparison of segmentation algorithms
- Easy to add new test cases (just edit JSON)
- Documents expected behavior for ambiguous cases
- Validates that Jieba improves segmentation accuracy
- Test data can be reviewed independently from code

Build System Integration:
- Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON
- Automatically run with 'make test' or 'ctest'

reorder

* Fix Jieba tests in Bazel and add more examples.

* Add comprehensive Jieba segmentation documentation

Added two detailed documentation files:

1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines)
   - Comprehensive feasibility analysis for integrating Jieba segmentation
   - Compares two implementation approaches:
     * cppjieba (C++ native) - RECOMMENDED
     * Python embedding via pybind11 - Not recommended
   - Technical analysis of Jieba's algorithm (Trie, DAG, HMM)
   - Detailed implementation plan with code examples
   - Performance, deployment, and maintenance comparison matrix
   - 4-phase implementation roadmap
   - Risk assessment and mitigation strategies

2. doc/JIEBA_USAGE.md
   - Complete user guide for Jieba segmentation feature
   - Compilation instructions with CMake
   - Configuration file format and examples
   - C++/CLI/Python API usage examples
   - Custom user dictionary guide
   - Performance considerations and benchmarks
   - mmseg vs Jieba comparison table
   - Troubleshooting guide
   - Limitations and best practices

Key recommendations:
- Use cppjieba for production (performance, zero dependencies)
- Enable via -DENABLE_JIEBA=ON compile flag
- Experimental feature, opt-in only

----
* Fix C++ compiler compatibility
* Fix //python/tests:test_opencc

---------

Co-authored-by: Claude <noreply@anthropic.com>
frankslin added a commit that referenced this pull request Jan 25, 2026
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License)

----
* Check in a copy of Jieba dictionary in  data/jieba_dict/ for OpenCC:

* jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies
* hmm_model.utf8 (508KB): HMM model for unknown word recognition
* user.dict.utf8: User-defined custom dictionary
* README.md: Dictionary documentation and customization guide

----
* Implement (experimental) Jieba segmentation support

----
* Add comprehensive test suite for Jieba segmentation

Added unit tests and comparison tests following OpenCC testing patterns.

1. Basic Unit Tests (src/JiebaSegmentationTest.cpp):
   - BasicSegmentation: Validates basic Chinese word segmentation
   - ComplexPhrase: Tests multi-word phrases and proper nouns
   - EmptyString, SingleCharacter: Edge case handling
   - EnglishAndChinese: Mixed language support
   - UnknownWords: HMM model's ability to recognize unknown words

2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp):
   - Follows t2cngov test pattern with external JSON test cases
   - Loads test definitions from test/testcases/jieba_comparison_testcases.json
   - Compares mmseg vs Jieba segmentation outputs
   - Displays: Input, Jieba segments, Expected segments, Conversion outputs
   - Converter caching for performance optimization

3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json):
   - 15 comprehensive test cases covering:
     * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba
     * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba

   Key test scenarios:
   - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous)
     "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女"

   - jieba_s2t_002: Compound words (中学生, 中等身材)
     "一个中学生,一个中等身材的人"

   - jieba_t2s_001: Traditional 著名/為 conversion
     "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女"

   - Other cases: Proper nouns, modern terms, mixed content,
     ambiguous structures, Taiwan-specific vocabulary,
     long compounds, classical Chinese

4. Focused Individual Tests:
   - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity
   - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion

Output Format:
  === Test: jieba_s2t_001 ===
  Input:          生活着名为正敏的少女
  Jieba segments: 生活/着/名为/正敏/的/少女
  Expected segs:  生活/着/名为/正敏/的/少女
  s2twp:          生活著名爲正敏的少女
  s2twp_jieba:    生活著名為正敏的少女

Benefits:
- Visual comparison of segmentation algorithms
- Easy to add new test cases (just edit JSON)
- Documents expected behavior for ambiguous cases
- Validates that Jieba improves segmentation accuracy
- Test data can be reviewed independently from code

Build System Integration:
- Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON
- Automatically run with 'make test' or 'ctest'

reorder

* Fix Jieba tests in Bazel and add more examples.

* Add comprehensive Jieba segmentation documentation

Added two detailed documentation files:

1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines)
   - Comprehensive feasibility analysis for integrating Jieba segmentation
   - Compares two implementation approaches:
     * cppjieba (C++ native) - RECOMMENDED
     * Python embedding via pybind11 - Not recommended
   - Technical analysis of Jieba's algorithm (Trie, DAG, HMM)
   - Detailed implementation plan with code examples
   - Performance, deployment, and maintenance comparison matrix
   - 4-phase implementation roadmap
   - Risk assessment and mitigation strategies

2. doc/JIEBA_USAGE.md
   - Complete user guide for Jieba segmentation feature
   - Compilation instructions with CMake
   - Configuration file format and examples
   - C++/CLI/Python API usage examples
   - Custom user dictionary guide
   - Performance considerations and benchmarks
   - mmseg vs Jieba comparison table
   - Troubleshooting guide
   - Limitations and best practices

Key recommendations:
- Use cppjieba for production (performance, zero dependencies)
- Enable via -DENABLE_JIEBA=ON compile flag
- Experimental feature, opt-in only

----
* Fix C++ compiler compatibility
* Fix //python/tests:test_opencc

---------

Co-authored-by: Claude <noreply@anthropic.com>
frankslin added a commit that referenced this pull request Jan 28, 2026
* Check in a complete copy of libcppjieba from https://github.com/yanyiwu/cppjieba (MIT License)

----
* Check in a copy of Jieba dictionary in  data/jieba_dict/ for OpenCC:

* jieba.dict.utf8 (4.9MB): Main dictionary with word frequencies
* hmm_model.utf8 (508KB): HMM model for unknown word recognition
* user.dict.utf8: User-defined custom dictionary
* README.md: Dictionary documentation and customization guide

----
* Implement (experimental) Jieba segmentation support

----
* Add comprehensive test suite for Jieba segmentation

Added unit tests and comparison tests following OpenCC testing patterns.

1. Basic Unit Tests (src/JiebaSegmentationTest.cpp):
   - BasicSegmentation: Validates basic Chinese word segmentation
   - ComplexPhrase: Tests multi-word phrases and proper nouns
   - EmptyString, SingleCharacter: Edge case handling
   - EnglishAndChinese: Mixed language support
   - UnknownWords: HMM model's ability to recognize unknown words

2. JSON-Driven Comparison Tests (src/JiebaComparisonTest.cpp):
   - Follows t2cngov test pattern with external JSON test cases
   - Loads test definitions from test/testcases/jieba_comparison_testcases.json
   - Compares mmseg vs Jieba segmentation outputs
   - Displays: Input, Jieba segments, Expected segments, Conversion outputs
   - Converter caching for performance optimization

3. Test Cases Definition (test/testcases/jieba_comparison_testcases.json):
   - 15 comprehensive test cases covering:
     * Simplified to Traditional (10 cases): s2twp vs s2twp_jieba
     * Traditional to Simplified (5 cases): tw2sp vs tw2sp_jieba

   Key test scenarios:
   - jieba_s2t_001: 着名 ambiguity (wearing+name vs famous)
     "生活着名为正敏的少女" -> Expected: "生活/着/名为/正敏/的/少女"

   - jieba_s2t_002: Compound words (中学生, 中等身材)
     "一个中学生,一个中等身材的人"

   - jieba_t2s_001: Traditional 著名/為 conversion
     "生活著名為正敏的少女" -> Expected: "生活/著名/為/正敏/的/少女"

   - Other cases: Proper nouns, modern terms, mixed content,
     ambiguous structures, Taiwan-specific vocabulary,
     long compounds, classical Chinese

4. Focused Individual Tests:
   - AmbiguousCase_ZhaoMing: Detailed output for "着名" ambiguity
   - TraditionalToSimplified_ZhuMing: Detailed output for "著名" conversion

Output Format:
  === Test: jieba_s2t_001 ===
  Input:          生活着名为正敏的少女
  Jieba segments: 生活/着/名为/正敏/的/少女
  Expected segs:  生活/着/名为/正敏/的/少女
  s2twp:          生活著名爲正敏的少女
  s2twp_jieba:    生活著名為正敏的少女

Benefits:
- Visual comparison of segmentation algorithms
- Easy to add new test cases (just edit JSON)
- Documents expected behavior for ambiguous cases
- Validates that Jieba improves segmentation accuracy
- Test data can be reviewed independently from code

Build System Integration:
- Tests added to CMake UNITTESTS when ENABLE_JIEBA=ON
- Automatically run with 'make test' or 'ctest'

reorder

* Fix Jieba tests in Bazel and add more examples.

* Add comprehensive Jieba segmentation documentation

Added two detailed documentation files:

1. doc/JIEBA_SEGMENTATION_FEASIBILITY.md (559 lines)
   - Comprehensive feasibility analysis for integrating Jieba segmentation
   - Compares two implementation approaches:
     * cppjieba (C++ native) - RECOMMENDED
     * Python embedding via pybind11 - Not recommended
   - Technical analysis of Jieba's algorithm (Trie, DAG, HMM)
   - Detailed implementation plan with code examples
   - Performance, deployment, and maintenance comparison matrix
   - 4-phase implementation roadmap
   - Risk assessment and mitigation strategies

2. doc/JIEBA_USAGE.md
   - Complete user guide for Jieba segmentation feature
   - Compilation instructions with CMake
   - Configuration file format and examples
   - C++/CLI/Python API usage examples
   - Custom user dictionary guide
   - Performance considerations and benchmarks
   - mmseg vs Jieba comparison table
   - Troubleshooting guide
   - Limitations and best practices

Key recommendations:
- Use cppjieba for production (performance, zero dependencies)
- Enable via -DENABLE_JIEBA=ON compile flag
- Experimental feature, opt-in only

----
* Fix C++ compiler compatibility
* Fix //python/tests:test_opencc

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants