From 44694525880977c103a2fb949041b4835d780109 Mon Sep 17 00:00:00 2001 From: Frank Lin Date: Wed, 31 Dec 2025 16:23:59 -0800 Subject: [PATCH 1/2] Add AGENTS.md and CLAUDE.md for AI agents. Add opencc-fix-translation-workflow skill --- .../skills/opencc-fix-translation-workflow.md | 85 +++++++++++++++++++ AGENTS.md | 84 ++++++++++++++++++ CLAUDE.md | 1 + 3 files changed, 170 insertions(+) create mode 100644 .claude/skills/opencc-fix-translation-workflow.md create mode 100644 AGENTS.md create mode 100644 CLAUDE.md diff --git a/.claude/skills/opencc-fix-translation-workflow.md b/.claude/skills/opencc-fix-translation-workflow.md new file mode 100644 index 000000000..5a5368271 --- /dev/null +++ b/.claude/skills/opencc-fix-translation-workflow.md @@ -0,0 +1,85 @@ +--- +name: opencc-fix-translation-workflow +description: OpenCC translation fix and complete release workflow +tags: [opencc, workflow, debugging] +--- + +# OpenCC Translation Fix Standard Operating Procedure + +This skill describes the complete lifecycle for fixing OpenCC conversion errors (such as "方程式" becoming "方程序"), including core dictionary correction, testing, and verification. + +## 1. Problem Diagnosis + +When a conversion error is discovered (e.g., A is incorrectly converted to B): + +1. **Search for existing mappings**: + Use `grep` to search for the error source in `data/dictionary`. + ```bash + grep "error_term" data/dictionary/*.txt + ``` +2. **Identify the interference source**: + Usually because in Maximum Forward Matching (MaxMatch), a "longer word" contains the target word, or a "shorter word" mapping causes the incorrect result. + *Example*: "方程式" is incorrectly converted to "方程序" because the mapping "程式" → "程序" exists, and "方程式" is not defined as a proper noun, causing it to be segmented as "方" + "程式". + +## 2. Fix Solution (Explicit Mapping) + +If the error originates from segmentation logic (as in the example above), the most robust fix is to **add an Explicit Mapping**. + +1. **Select the correct dictionary file**: + - For s2twp and tw2sp: `TWPhrases.txt` + +2. **Add the mapping**: + Map the vocabulary to itself to prevent incorrect segmentation or conversion. + ```text + 方程式 方程式 + ``` + *Note*: Maintain dictionary alphabetical sorting (if applicable). + +## 3. Test-Driven (Test Cases) + +Before the modification takes effect, create test cases to ensure the fix and prevent regression. + +1. **Core tests**: + Edit `test/testcases/testcases.json`. + ```json + { + "id": "case_XXX", + "input": "方程式", + "expected": { + "tw2sp": "方程式" + } + } + ``` + +## 4. Build and Verify + +OpenCC uses the CMake/Make system to build dictionaries. + +1. **Rebuild dictionaries**: + ```bash + cd build/dbg # or your build directory + make Dictionaries + ``` + This step regenerates the `.ocd2` binary dictionaries. + +2. **Manual verification**: + Test directly using the generated `opencc` tool. + ```bash + echo "方程式" | ./src/tools/opencc -c root/share/opencc/tw2sp.json + # Expected output: 方程式 + ``` + +3. **Automated testing** (optional but recommended): + Run `make test` or `ctest`. + + +## 5. Commit + +When committing, it is recommended to clearly separate or combine, but must include: +- Dictionary text file changes (`.txt`) +- Core test changes (`test/testcases/testcases.json`) + +```bash +git add data/dictionary/TWPhrases.txt test/testcases/testcases.json +git commit -m "Fix(Dictionary): correct conversion for 'XYZ'" +``` diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..12c688b7b --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,84 @@ +# OpenCC Project Overview + +This document compiles the Open Chinese Convert (OpenCC) project information to help quickly familiarize with the code structure, data organization, and accompanying tools. + +## Project Overview +- OpenCC is an open-source Chinese Simplified-Traditional and regional variant conversion tool, supporting Simplified↔Traditional, Hong Kong/Macau/Taiwan regional differences, Japanese Shinjitai/Kyujitai character forms, and other conversion schemes. +- The project provides a C++ core library, C language interface, command-line tools, as well as Python, Node.js and other language bindings. The dictionary and program are decoupled for easy customization and extension. +- Main dependencies: `rapidjson` for configuration parsing, `marisa-trie` for high-performance dictionaries (`.ocd2`), optional `Darts` for legacy `.ocd` support. + +## Core Modules and Flow +1. **Configuration Loading (`src/Config.cpp`)** + - Reads JSON configuration (located in `data/config/*.json`), parses segmenter definitions and conversion chains. + - Loads different dictionary formats (plain text, `ocd2`, dictionary groups) based on the `type` field, with support for additional search paths. + - Creates `Converter` objects that hold segmenters and conversion chains. + +2. **Segmentation (`src/MaxMatchSegmentation.cpp`)** + - The default segmentation type is `mmseg`, i.e., Maximum Forward Matching. + - Performs longest prefix matching using the dictionary, splitting input into `Segments`; unmatched UTF-8 fragments are preserved by character length. + +3. **Conversion Chain (`src/ConversionChain.cpp`, `src/Conversion.cpp`)** + - The conversion chain is an ordered list of `Conversion` objects, each node relies on a dictionary to replace segments with target values through longest prefix matching. + - Supports advanced scenarios like phrase priority, variant character replacement, and multi-stage composition. + +4. **Dictionary System** + - Abstract interface `Dict` unifies prefix matching, all-prefix matching, and dictionary traversal. + - `TextDict` (`.txt`) builds dictionaries from tab-delimited plain text; `MarisaDict` (`.ocd2`) provides high-performance trie structures; `DictGroup` can compose multiple dictionaries into a sequential collection. + - `SerializableDict` defines serialization and file loading logic, which command-line tools use to convert between different formats. + +5. **API Encapsulation** + - `SimpleConverter` (high-level C++ interface) encapsulates `Config + Converter`, providing various overloads for string, pointer buffer, and partial length conversion. + - `opencc.h` exposes the C API: `opencc_open`, `opencc_convert_utf8`, etc., for language bindings and command-line reuse. + - The command-line program `opencc` (`src/tools/CommandLine.cpp`) demonstrates batch conversion, stream reading, auto-flushing, and same-file input/output handling. + +## Data and Configuration +- Dictionaries are maintained in `data/dictionary/*.txt`, covering phrases, characters, regional differences, Japanese new characters, and other topic files; converted to `.ocd2` during build for acceleration. +- Default configurations are located in `data/config/`, such as `s2t.json`, `t2s.json`, `s2tw.json`, etc., defining segmenter types, dictionaries used, and combination methods. +- `data/scheme` and `data/scripts` provide dictionary compilation scripts and specification validation tools. + +### Dictionary Binary Formats: `.ocd` and `.ocd2` +- `.ocd` (legacy format) has `OPENCCDARTS1` as the file header, with the main body being serialized Darts double-array trie data, combined with `BinaryDict` structure to store key-value offsets and concatenation buffers. Loading process is detailed in `src/DartsDict.cpp` and `src/BinaryDict.cpp`. Commonly used in environments requiring `ENABLE_DARTS` for compatibility. +- `.ocd2` (default format) has `OPENCC_MARISA_0.2.5` as the file header, followed by `marisa::Trie` data, then uses the `SerializedValues` module to store all candidate value lists. See `src/MarisaDict.cpp`, `src/SerializedValues.cpp` for details. This format is smaller and loads faster (e.g., `NEWS.md` records `STPhrases` reduced from 4.3MB to 924KB). +- The command-line tool `opencc_dict` supports `text ↔ ocd2` (and optionally `ocd`) conversion. When adding or adjusting dictionaries, first edit `.txt`, then run the tool to generate the target format. + +## Development and Testing +- The top-level build system supports CMake, Bazel, Node.js `binding.gyp`, Python `pyproject.toml`, with cross-platform CI integration. +- `src/*Test.cpp`, `test/` directories contain Google Test-style unit tests covering dictionary matching, conversion chains, segmentation, and other key logic. +- Tools `opencc_dict`, `opencc_phrase_extract` (`src/tools/`) help developers convert dictionary formats and extract phrases. + +## Ecosystem Bindings +- Python module is located in `python/`, providing the `OpenCC` class through the C API. +- Node.js extension is in the `node/` directory, using N-API/Node-API to call the core library. +- README lists third-party Swift, Java, Go, WebAssembly and other porting projects, showcasing ecosystem breadth. + +## Common Customization Steps +1. Edit or add dictionary entries in `data/dictionary/*.txt`. +2. Use `opencc_dict` to convert to `.ocd2`. +3. Copy/modify configuration JSON in `data/config` and specify new dictionary files. +4. Load custom configuration through `SimpleConverter`, command-line tools, or language bindings to verify results. + +> For deeper understanding, read the module documentation in `src/README.md`, or refer to test cases in `test/` to understand conversion chain combinations. + +## Browser and Third-Party Implementation Notes +- Official pure frontend execution is not directly supported; community solutions (such as `opencc-js`, `opencc-wasm`) can be referenced. +- For self-compiled WebAssembly, use Emscripten to write `.ocd2` to the virtual file system, call conversion in Web Worker to avoid blocking UI, and use gzip/brotli with Service Worker caching to reduce initial load cost. +- For pure JavaScript table lookup, pre-process dictionaries into JSON/Trie structures and implement longest prefix matching manually; pay attention to resource size control and avoid unnecessary string copies when converting long texts. + +### Common Deviations in Third-Party Implementations (Speculation) +- **Missing segmentation and conversion chain order**: If `group` configuration or dictionary priority is not restored, compound words may be split apart or overwritten by single characters. +- **Missing longest prefix logic**: Character-by-character replacement alone will miss idioms and multi-character word results. +- **Improper UTF-8 handling**: Overlooking multi-byte characters or surrogate pair handling can easily cause offset or truncation issues. +- **Incomplete dictionaries/configuration**: Missing segmentation dictionaries, regional differences and other `.ocd2` files will result in missing words in output. +- **Path and loading process differences**: If OpenCC's path search and configuration parsing details are not followed, the actual loaded resources will differ from official ones, naturally leading to different results. + +## Further Reading + +### Technical Documents +- **[Algorithm and Theoretical Limitations Analysis](doc/ALGORITHM_AND_LIMITATIONS.md)** - In-depth exploration of OpenCC's core algorithm (Maximum Forward Matching segmentation), conversion chain mechanism, dictionary system, and theoretical limitations faced in Chinese Simplified-Traditional conversion (one-to-many ambiguity, lack of context understanding, maintenance burden, etc.). + +### Contribution Guide +- **[CONTRIBUTING.md](CONTRIBUTING.md)** - Complete guide on how to contribute dictionary entries to OpenCC, write test cases, and execute testing procedures. + +### Project Documents +- **[src/README.md](src/README.md)** - Detailed technical documentation for core modules. +- **[README.md](README.md)** - Project overview, installation and usage guide. diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..43c994c2d --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +@AGENTS.md From db25b4ccbe7893f967e15573a45a925c63274574 Mon Sep 17 00:00:00 2001 From: Frank Lin Date: Sun, 25 Jan 2026 21:55:53 -0800 Subject: [PATCH 2/2] Move 'Code modules and flow' section into src/README.md and remove a few unrelated sections --- AGENTS.md | 32 -------------------------------- src/README.md | 24 ++++++++++++++++++++++++ 2 files changed, 24 insertions(+), 32 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 12c688b7b..04bd2d048 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -7,30 +7,6 @@ This document compiles the Open Chinese Convert (OpenCC) project information to - The project provides a C++ core library, C language interface, command-line tools, as well as Python, Node.js and other language bindings. The dictionary and program are decoupled for easy customization and extension. - Main dependencies: `rapidjson` for configuration parsing, `marisa-trie` for high-performance dictionaries (`.ocd2`), optional `Darts` for legacy `.ocd` support. -## Core Modules and Flow -1. **Configuration Loading (`src/Config.cpp`)** - - Reads JSON configuration (located in `data/config/*.json`), parses segmenter definitions and conversion chains. - - Loads different dictionary formats (plain text, `ocd2`, dictionary groups) based on the `type` field, with support for additional search paths. - - Creates `Converter` objects that hold segmenters and conversion chains. - -2. **Segmentation (`src/MaxMatchSegmentation.cpp`)** - - The default segmentation type is `mmseg`, i.e., Maximum Forward Matching. - - Performs longest prefix matching using the dictionary, splitting input into `Segments`; unmatched UTF-8 fragments are preserved by character length. - -3. **Conversion Chain (`src/ConversionChain.cpp`, `src/Conversion.cpp`)** - - The conversion chain is an ordered list of `Conversion` objects, each node relies on a dictionary to replace segments with target values through longest prefix matching. - - Supports advanced scenarios like phrase priority, variant character replacement, and multi-stage composition. - -4. **Dictionary System** - - Abstract interface `Dict` unifies prefix matching, all-prefix matching, and dictionary traversal. - - `TextDict` (`.txt`) builds dictionaries from tab-delimited plain text; `MarisaDict` (`.ocd2`) provides high-performance trie structures; `DictGroup` can compose multiple dictionaries into a sequential collection. - - `SerializableDict` defines serialization and file loading logic, which command-line tools use to convert between different formats. - -5. **API Encapsulation** - - `SimpleConverter` (high-level C++ interface) encapsulates `Config + Converter`, providing various overloads for string, pointer buffer, and partial length conversion. - - `opencc.h` exposes the C API: `opencc_open`, `opencc_convert_utf8`, etc., for language bindings and command-line reuse. - - The command-line program `opencc` (`src/tools/CommandLine.cpp`) demonstrates batch conversion, stream reading, auto-flushing, and same-file input/output handling. - ## Data and Configuration - Dictionaries are maintained in `data/dictionary/*.txt`, covering phrases, characters, regional differences, Japanese new characters, and other topic files; converted to `.ocd2` during build for acceleration. - Default configurations are located in `data/config/`, such as `s2t.json`, `t2s.json`, `s2tw.json`, etc., defining segmenter types, dictionaries used, and combination methods. @@ -59,11 +35,6 @@ This document compiles the Open Chinese Convert (OpenCC) project information to > For deeper understanding, read the module documentation in `src/README.md`, or refer to test cases in `test/` to understand conversion chain combinations. -## Browser and Third-Party Implementation Notes -- Official pure frontend execution is not directly supported; community solutions (such as `opencc-js`, `opencc-wasm`) can be referenced. -- For self-compiled WebAssembly, use Emscripten to write `.ocd2` to the virtual file system, call conversion in Web Worker to avoid blocking UI, and use gzip/brotli with Service Worker caching to reduce initial load cost. -- For pure JavaScript table lookup, pre-process dictionaries into JSON/Trie structures and implement longest prefix matching manually; pay attention to resource size control and avoid unnecessary string copies when converting long texts. - ### Common Deviations in Third-Party Implementations (Speculation) - **Missing segmentation and conversion chain order**: If `group` configuration or dictionary priority is not restored, compound words may be split apart or overwritten by single characters. - **Missing longest prefix logic**: Character-by-character replacement alone will miss idioms and multi-character word results. @@ -73,9 +44,6 @@ This document compiles the Open Chinese Convert (OpenCC) project information to ## Further Reading -### Technical Documents -- **[Algorithm and Theoretical Limitations Analysis](doc/ALGORITHM_AND_LIMITATIONS.md)** - In-depth exploration of OpenCC's core algorithm (Maximum Forward Matching segmentation), conversion chain mechanism, dictionary system, and theoretical limitations faced in Chinese Simplified-Traditional conversion (one-to-many ambiguity, lack of context understanding, maintenance burden, etc.). - ### Contribution Guide - **[CONTRIBUTING.md](CONTRIBUTING.md)** - Complete guide on how to contribute dictionary entries to OpenCC, write test cases, and execute testing procedures. diff --git a/src/README.md b/src/README.md index 1c98af273..695aab021 100644 --- a/src/README.md +++ b/src/README.md @@ -1,5 +1,29 @@ # Source code +## Code Modules and Flow +1. **Configuration Loading (`src/Config.cpp`)** + - Reads JSON configuration (located in `data/config/*.json`), parses segmenter definitions and conversion chains. + - Loads different dictionary formats (plain text, `ocd2`, dictionary groups) based on the `type` field, with support for additional search paths. + - Creates `Converter` objects that hold segmenters and conversion chains. + +2. **Segmentation (`src/MaxMatchSegmentation.cpp`)** + - The default segmentation type is `mmseg`, i.e., Maximum Forward Matching. + - Performs longest prefix matching using the dictionary, splitting input into `Segments`; unmatched UTF-8 fragments are preserved by character length. + +3. **Conversion Chain (`src/ConversionChain.cpp`, `src/Conversion.cpp`)** + - The conversion chain is an ordered list of `Conversion` objects, each node relies on a dictionary to replace segments with target values through longest prefix matching. + - Supports advanced scenarios like phrase priority, variant character replacement, and multi-stage composition. + +4. **Dictionary System** + - Abstract interface `Dict` unifies prefix matching, all-prefix matching, and dictionary traversal. + - `TextDict` (`.txt`) builds dictionaries from tab-delimited plain text; `MarisaDict` (`.ocd2`) provides high-performance trie structures; `DictGroup` can compose multiple dictionaries into a sequential collection. + - `SerializableDict` defines serialization and file loading logic, which command-line tools use to convert between different formats. + +5. **API Encapsulation** + - `SimpleConverter` (high-level C++ interface) encapsulates `Config + Converter`, providing various overloads for string, pointer buffer, and partial length conversion. + - `opencc.h` exposes the C API: `opencc_open`, `opencc_convert_utf8`, etc., for language bindings and command-line reuse. + - The command-line program `opencc` (`src/tools/CommandLine.cpp`) demonstrates batch conversion, stream reading, auto-flushing, and same-file input/output handling. + ## Dictionary ### Interface