From 44694525880977c103a2fb949041b4835d780109 Mon Sep 17 00:00:00 2001
From: Frank Lin <github@linshuang.info>
Date: Wed, 31 Dec 2025 16:23:59 -0800
Subject: [PATCH 1/2] Add AGENTS.md and CLAUDE.md for AI agents.

Add opencc-fix-translation-workflow skill
---
 .../skills/opencc-fix-translation-workflow.md | 85 +++++++++++++++++++
 AGENTS.md                                     | 84 ++++++++++++++++++
 CLAUDE.md                                     |  1 +
 3 files changed, 170 insertions(+)
 create mode 100644 .claude/skills/opencc-fix-translation-workflow.md
 create mode 100644 AGENTS.md
 create mode 100644 CLAUDE.md

diff --git a/.claude/skills/opencc-fix-translation-workflow.md b/.claude/skills/opencc-fix-translation-workflow.md
new file mode 100644
index 000000000..5a5368271
--- /dev/null
+++ b/.claude/skills/opencc-fix-translation-workflow.md
@@ -0,0 +1,85 @@
+---
+name: opencc-fix-translation-workflow
+description: OpenCC translation fix and complete release workflow
+tags: [opencc, workflow, debugging]
+---
+
+# OpenCC Translation Fix Standard Operating Procedure
+
+This skill describes the complete lifecycle for fixing OpenCC conversion errors (such as "方程式" becoming "方程序"), including core dictionary correction, testing, and verification.
+
+## 1. Problem Diagnosis
+
+When a conversion error is discovered (e.g., A is incorrectly converted to B):
+
+1.  **Search for existing mappings**:
+    Use `grep` to search for the error source in `data/dictionary`.
+    ```bash
+    grep "error_term" data/dictionary/*.txt
+    ```
+2.  **Identify the interference source**:
+    Usually because in Maximum Forward Matching (MaxMatch), a "longer word" contains the target word, or a "shorter word" mapping causes the incorrect result.
+    *Example*: "方程式" is incorrectly converted to "方程序" because the mapping "程式" → "程序" exists, and "方程式" is not defined as a proper noun, causing it to be segmented as "方" + "程式".
+
+## 2. Fix Solution (Explicit Mapping)
+
+If the error originates from segmentation logic (as in the example above), the most robust fix is to **add an Explicit Mapping**.
+
+1.  **Select the correct dictionary file**:
+    - For s2twp and tw2sp: `TWPhrases.txt`
+
+2.  **Add the mapping**:
+    Map the vocabulary to itself to prevent incorrect segmentation or conversion.
+    ```text
+    方程式	方程式
+    ```
+    *Note*: Maintain dictionary alphabetical sorting (if applicable).
+
+## 3. Test-Driven (Test Cases)
+
+Before the modification takes effect, create test cases to ensure the fix and prevent regression.
+
+1.  **Core tests**:
+    Edit `test/testcases/testcases.json`.
+    ```json
+    {
+      "id": "case_XXX",
+      "input": "方程式",
+      "expected": {
+        "tw2sp": "方程式"
+      }
+    }
+    ```
+
+## 4. Build and Verify
+
+OpenCC uses the CMake/Make system to build dictionaries.
+
+1.  **Rebuild dictionaries**:
+    ```bash
+    cd build/dbg  # or your build directory
+    make Dictionaries
+    ```
+    This step regenerates the `.ocd2` binary dictionaries.
+
+2.  **Manual verification**:
+    Test directly using the generated `opencc` tool.
+    ```bash
+    echo "方程式" | ./src/tools/opencc -c root/share/opencc/tw2sp.json
+    # Expected output: 方程式
+    ```
+
+3.  **Automated testing** (optional but recommended):
+    Run `make test` or `ctest`.
+
+
+## 5. Commit
+
+When committing, it is recommended to clearly separate or combine, but must include:
+- Dictionary text file changes (`.txt`)
+- Core test changes (`test/testcases/testcases.json`)
+
+```bash
+git add data/dictionary/TWPhrases.txt test/testcases/testcases.json
+git commit -m "Fix(Dictionary): correct conversion for 'XYZ'"
+```
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 000000000..12c688b7b
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,84 @@
+# OpenCC Project Overview
+
+This document compiles the Open Chinese Convert (OpenCC) project information to help quickly familiarize with the code structure, data organization, and accompanying tools.
+
+## Project Overview
+- OpenCC is an open-source Chinese Simplified-Traditional and regional variant conversion tool, supporting Simplified↔Traditional, Hong Kong/Macau/Taiwan regional differences, Japanese Shinjitai/Kyujitai character forms, and other conversion schemes.
+- The project provides a C++ core library, C language interface, command-line tools, as well as Python, Node.js and other language bindings. The dictionary and program are decoupled for easy customization and extension.
+- Main dependencies: `rapidjson` for configuration parsing, `marisa-trie` for high-performance dictionaries (`.ocd2`), optional `Darts` for legacy `.ocd` support.
+
+## Core Modules and Flow
+1. **Configuration Loading (`src/Config.cpp`)**
+   - Reads JSON configuration (located in `data/config/*.json`), parses segmenter definitions and conversion chains.
+   - Loads different dictionary formats (plain text, `ocd2`, dictionary groups) based on the `type` field, with support for additional search paths.
+   - Creates `Converter` objects that hold segmenters and conversion chains.
+
+2. **Segmentation (`src/MaxMatchSegmentation.cpp`)**
+   - The default segmentation type is `mmseg`, i.e., Maximum Forward Matching.
+   - Performs longest prefix matching using the dictionary, splitting input into `Segments`; unmatched UTF-8 fragments are preserved by character length.
+
+3. **Conversion Chain (`src/ConversionChain.cpp`, `src/Conversion.cpp`)**
+   - The conversion chain is an ordered list of `Conversion` objects, each node relies on a dictionary to replace segments with target values through longest prefix matching.
+   - Supports advanced scenarios like phrase priority, variant character replacement, and multi-stage composition.
+
+4. **Dictionary System**
+   - Abstract interface `Dict` unifies prefix matching, all-prefix matching, and dictionary traversal.
+   - `TextDict` (`.txt`) builds dictionaries from tab-delimited plain text; `MarisaDict` (`.ocd2`) provides high-performance trie structures; `DictGroup` can compose multiple dictionaries into a sequential collection.
+   - `SerializableDict` defines serialization and file loading logic, which command-line tools use to convert between different formats.
+
+5. **API Encapsulation**
+   - `SimpleConverter` (high-level C++ interface) encapsulates `Config + Converter`, providing various overloads for string, pointer buffer, and partial length conversion.
+   - `opencc.h` exposes the C API: `opencc_open`, `opencc_convert_utf8`, etc., for language bindings and command-line reuse.
+   - The command-line program `opencc` (`src/tools/CommandLine.cpp`) demonstrates batch conversion, stream reading, auto-flushing, and same-file input/output handling.
+
+## Data and Configuration
+- Dictionaries are maintained in `data/dictionary/*.txt`, covering phrases, characters, regional differences, Japanese new characters, and other topic files; converted to `.ocd2` during build for acceleration.
+- Default configurations are located in `data/config/`, such as `s2t.json`, `t2s.json`, `s2tw.json`, etc., defining segmenter types, dictionaries used, and combination methods.
+- `data/scheme` and `data/scripts` provide dictionary compilation scripts and specification validation tools.
+
+### Dictionary Binary Formats: `.ocd` and `.ocd2`
+- `.ocd` (legacy format) has `OPENCCDARTS1` as the file header, with the main body being serialized Darts double-array trie data, combined with `BinaryDict` structure to store key-value offsets and concatenation buffers. Loading process is detailed in `src/DartsDict.cpp` and `src/BinaryDict.cpp`. Commonly used in environments requiring `ENABLE_DARTS` for compatibility.
+- `.ocd2` (default format) has `OPENCC_MARISA_0.2.5` as the file header, followed by `marisa::Trie` data, then uses the `SerializedValues` module to store all candidate value lists. See `src/MarisaDict.cpp`, `src/SerializedValues.cpp` for details. This format is smaller and loads faster (e.g., `NEWS.md` records `STPhrases` reduced from 4.3MB to 924KB).
+- The command-line tool `opencc_dict` supports `text ↔ ocd2` (and optionally `ocd`) conversion. When adding or adjusting dictionaries, first edit `.txt`, then run the tool to generate the target format.
+
+## Development and Testing
+- The top-level build system supports CMake, Bazel, Node.js `binding.gyp`, Python `pyproject.toml`, with cross-platform CI integration.
+- `src/*Test.cpp`, `test/` directories contain Google Test-style unit tests covering dictionary matching, conversion chains, segmentation, and other key logic.
+- Tools `opencc_dict`, `opencc_phrase_extract` (`src/tools/`) help developers convert dictionary formats and extract phrases.
+
+## Ecosystem Bindings
+- Python module is located in `python/`, providing the `OpenCC` class through the C API.
+- Node.js extension is in the `node/` directory, using N-API/Node-API to call the core library.
+- README lists third-party Swift, Java, Go, WebAssembly and other porting projects, showcasing ecosystem breadth.
+
+## Common Customization Steps
+1. Edit or add dictionary entries in `data/dictionary/*.txt`.
+2. Use `opencc_dict` to convert to `.ocd2`.
+3. Copy/modify configuration JSON in `data/config` and specify new dictionary files.
+4. Load custom configuration through `SimpleConverter`, command-line tools, or language bindings to verify results.
+
+> For deeper understanding, read the module documentation in `src/README.md`, or refer to test cases in `test/` to understand conversion chain combinations.
+
+## Browser and Third-Party Implementation Notes
+- Official pure frontend execution is not directly supported; community solutions (such as `opencc-js`, `opencc-wasm`) can be referenced.
+- For self-compiled WebAssembly, use Emscripten to write `.ocd2` to the virtual file system, call conversion in Web Worker to avoid blocking UI, and use gzip/brotli with Service Worker caching to reduce initial load cost.
+- For pure JavaScript table lookup, pre-process dictionaries into JSON/Trie structures and implement longest prefix matching manually; pay attention to resource size control and avoid unnecessary string copies when converting long texts.
+
+### Common Deviations in Third-Party Implementations (Speculation)
+- **Missing segmentation and conversion chain order**: If `group` configuration or dictionary priority is not restored, compound words may be split apart or overwritten by single characters.
+- **Missing longest prefix logic**: Character-by-character replacement alone will miss idioms and multi-character word results.
+- **Improper UTF-8 handling**: Overlooking multi-byte characters or surrogate pair handling can easily cause offset or truncation issues.
+- **Incomplete dictionaries/configuration**: Missing segmentation dictionaries, regional differences and other `.ocd2` files will result in missing words in output.
+- **Path and loading process differences**: If OpenCC's path search and configuration parsing details are not followed, the actual loaded resources will differ from official ones, naturally leading to different results.
+
+## Further Reading
+
+### Technical Documents
+- **[Algorithm and Theoretical Limitations Analysis](doc/ALGORITHM_AND_LIMITATIONS.md)** - In-depth exploration of OpenCC's core algorithm (Maximum Forward Matching segmentation), conversion chain mechanism, dictionary system, and theoretical limitations faced in Chinese Simplified-Traditional conversion (one-to-many ambiguity, lack of context understanding, maintenance burden, etc.).
+
+### Contribution Guide
+- **[CONTRIBUTING.md](CONTRIBUTING.md)** - Complete guide on how to contribute dictionary entries to OpenCC, write test cases, and execute testing procedures.
+
+### Project Documents
+- **[src/README.md](src/README.md)** - Detailed technical documentation for core modules.
+- **[README.md](README.md)** - Project overview, installation and usage guide.
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 000000000..43c994c2d
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1 @@
+@AGENTS.md

From db25b4ccbe7893f967e15573a45a925c63274574 Mon Sep 17 00:00:00 2001
From: Frank Lin <github@linshuang.info>
Date: Sun, 25 Jan 2026 21:55:53 -0800
Subject: [PATCH 2/2] Move 'Code modules and flow' section into src/README.md
 and remove a few unrelated sections

---
 AGENTS.md     | 32 --------------------------------
 src/README.md | 24 ++++++++++++++++++++++++
 2 files changed, 24 insertions(+), 32 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index 12c688b7b..04bd2d048 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -7,30 +7,6 @@ This document compiles the Open Chinese Convert (OpenCC) project information to
 - The project provides a C++ core library, C language interface, command-line tools, as well as Python, Node.js and other language bindings. The dictionary and program are decoupled for easy customization and extension.
 - Main dependencies: `rapidjson` for configuration parsing, `marisa-trie` for high-performance dictionaries (`.ocd2`), optional `Darts` for legacy `.ocd` support.
 
-## Core Modules and Flow
-1. **Configuration Loading (`src/Config.cpp`)**
-   - Reads JSON configuration (located in `data/config/*.json`), parses segmenter definitions and conversion chains.
-   - Loads different dictionary formats (plain text, `ocd2`, dictionary groups) based on the `type` field, with support for additional search paths.
-   - Creates `Converter` objects that hold segmenters and conversion chains.
-
-2. **Segmentation (`src/MaxMatchSegmentation.cpp`)**
-   - The default segmentation type is `mmseg`, i.e., Maximum Forward Matching.
-   - Performs longest prefix matching using the dictionary, splitting input into `Segments`; unmatched UTF-8 fragments are preserved by character length.
-
-3. **Conversion Chain (`src/ConversionChain.cpp`, `src/Conversion.cpp`)**
-   - The conversion chain is an ordered list of `Conversion` objects, each node relies on a dictionary to replace segments with target values through longest prefix matching.
-   - Supports advanced scenarios like phrase priority, variant character replacement, and multi-stage composition.
-
-4. **Dictionary System**
-   - Abstract interface `Dict` unifies prefix matching, all-prefix matching, and dictionary traversal.
-   - `TextDict` (`.txt`) builds dictionaries from tab-delimited plain text; `MarisaDict` (`.ocd2`) provides high-performance trie structures; `DictGroup` can compose multiple dictionaries into a sequential collection.
-   - `SerializableDict` defines serialization and file loading logic, which command-line tools use to convert between different formats.
-
-5. **API Encapsulation**
-   - `SimpleConverter` (high-level C++ interface) encapsulates `Config + Converter`, providing various overloads for string, pointer buffer, and partial length conversion.
-   - `opencc.h` exposes the C API: `opencc_open`, `opencc_convert_utf8`, etc., for language bindings and command-line reuse.
-   - The command-line program `opencc` (`src/tools/CommandLine.cpp`) demonstrates batch conversion, stream reading, auto-flushing, and same-file input/output handling.
-
 ## Data and Configuration
 - Dictionaries are maintained in `data/dictionary/*.txt`, covering phrases, characters, regional differences, Japanese new characters, and other topic files; converted to `.ocd2` during build for acceleration.
 - Default configurations are located in `data/config/`, such as `s2t.json`, `t2s.json`, `s2tw.json`, etc., defining segmenter types, dictionaries used, and combination methods.
@@ -59,11 +35,6 @@ This document compiles the Open Chinese Convert (OpenCC) project information to
 
 > For deeper understanding, read the module documentation in `src/README.md`, or refer to test cases in `test/` to understand conversion chain combinations.
 
-## Browser and Third-Party Implementation Notes
-- Official pure frontend execution is not directly supported; community solutions (such as `opencc-js`, `opencc-wasm`) can be referenced.
-- For self-compiled WebAssembly, use Emscripten to write `.ocd2` to the virtual file system, call conversion in Web Worker to avoid blocking UI, and use gzip/brotli with Service Worker caching to reduce initial load cost.
-- For pure JavaScript table lookup, pre-process dictionaries into JSON/Trie structures and implement longest prefix matching manually; pay attention to resource size control and avoid unnecessary string copies when converting long texts.
-
 ### Common Deviations in Third-Party Implementations (Speculation)
 - **Missing segmentation and conversion chain order**: If `group` configuration or dictionary priority is not restored, compound words may be split apart or overwritten by single characters.
 - **Missing longest prefix logic**: Character-by-character replacement alone will miss idioms and multi-character word results.
@@ -73,9 +44,6 @@ This document compiles the Open Chinese Convert (OpenCC) project information to
 
 ## Further Reading
 
-### Technical Documents
-- **[Algorithm and Theoretical Limitations Analysis](doc/ALGORITHM_AND_LIMITATIONS.md)** - In-depth exploration of OpenCC's core algorithm (Maximum Forward Matching segmentation), conversion chain mechanism, dictionary system, and theoretical limitations faced in Chinese Simplified-Traditional conversion (one-to-many ambiguity, lack of context understanding, maintenance burden, etc.).
-
 ### Contribution Guide
 - **[CONTRIBUTING.md](CONTRIBUTING.md)** - Complete guide on how to contribute dictionary entries to OpenCC, write test cases, and execute testing procedures.
 
diff --git a/src/README.md b/src/README.md
index 1c98af273..695aab021 100644
--- a/src/README.md
+++ b/src/README.md
@@ -1,5 +1,29 @@
 # Source code
 
+## Code Modules and Flow
+1. **Configuration Loading (`src/Config.cpp`)**
+   - Reads JSON configuration (located in `data/config/*.json`), parses segmenter definitions and conversion chains.
+   - Loads different dictionary formats (plain text, `ocd2`, dictionary groups) based on the `type` field, with support for additional search paths.
+   - Creates `Converter` objects that hold segmenters and conversion chains.
+
+2. **Segmentation (`src/MaxMatchSegmentation.cpp`)**
+   - The default segmentation type is `mmseg`, i.e., Maximum Forward Matching.
+   - Performs longest prefix matching using the dictionary, splitting input into `Segments`; unmatched UTF-8 fragments are preserved by character length.
+
+3. **Conversion Chain (`src/ConversionChain.cpp`, `src/Conversion.cpp`)**
+   - The conversion chain is an ordered list of `Conversion` objects, each node relies on a dictionary to replace segments with target values through longest prefix matching.
+   - Supports advanced scenarios like phrase priority, variant character replacement, and multi-stage composition.
+
+4. **Dictionary System**
+   - Abstract interface `Dict` unifies prefix matching, all-prefix matching, and dictionary traversal.
+   - `TextDict` (`.txt`) builds dictionaries from tab-delimited plain text; `MarisaDict` (`.ocd2`) provides high-performance trie structures; `DictGroup` can compose multiple dictionaries into a sequential collection.
+   - `SerializableDict` defines serialization and file loading logic, which command-line tools use to convert between different formats.
+
+5. **API Encapsulation**
+   - `SimpleConverter` (high-level C++ interface) encapsulates `Config + Converter`, providing various overloads for string, pointer buffer, and partial length conversion.
+   - `opencc.h` exposes the C API: `opencc_open`, `opencc_convert_utf8`, etc., for language bindings and command-line reuse.
+   - The command-line program `opencc` (`src/tools/CommandLine.cpp`) demonstrates batch conversion, stream reading, auto-flushing, and same-file input/output handling.
+
 ## Dictionary
 
 ### Interface