-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Add CLAUDE.md, AGENTS.md and a skill doc in English #1033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| --- | ||
| name: opencc-fix-translation-workflow | ||
| description: OpenCC translation fix and complete release workflow | ||
| tags: [opencc, workflow, debugging] | ||
| --- | ||
|
|
||
| # OpenCC Translation Fix Standard Operating Procedure | ||
|
|
||
| This skill describes the complete lifecycle for fixing OpenCC conversion errors (such as "方程式" becoming "方程序"), including core dictionary correction, testing, and verification. | ||
|
|
||
| ## 1. Problem Diagnosis | ||
|
|
||
| When a conversion error is discovered (e.g., A is incorrectly converted to B): | ||
|
|
||
| 1. **Search for existing mappings**: | ||
| Use `grep` to search for the error source in `data/dictionary`. | ||
| ```bash | ||
| grep "error_term" data/dictionary/*.txt | ||
| ``` | ||
| 2. **Identify the interference source**: | ||
| Usually because in Maximum Forward Matching (MaxMatch), a "longer word" contains the target word, or a "shorter word" mapping causes the incorrect result. | ||
| *Example*: "方程式" is incorrectly converted to "方程序" because the mapping "程式" → "程序" exists, and "方程式" is not defined as a proper noun, causing it to be segmented as "方" + "程式". | ||
|
|
||
| ## 2. Fix Solution (Explicit Mapping) | ||
|
|
||
| If the error originates from segmentation logic (as in the example above), the most robust fix is to **add an Explicit Mapping**. | ||
|
|
||
| 1. **Select the correct dictionary file**: | ||
| - For s2twp and tw2sp: `TWPhrases.txt` | ||
|
|
||
| 2. **Add the mapping**: | ||
| Map the vocabulary to itself to prevent incorrect segmentation or conversion. | ||
| ```text | ||
| 方程式 方程式 | ||
| ``` | ||
| *Note*: Maintain dictionary alphabetical sorting (if applicable). | ||
|
|
||
| ## 3. Test-Driven (Test Cases) | ||
|
|
||
| Before the modification takes effect, create test cases to ensure the fix and prevent regression. | ||
|
|
||
| 1. **Core tests**: | ||
| Edit `test/testcases/testcases.json`. | ||
| ```json | ||
| { | ||
| "id": "case_XXX", | ||
| "input": "方程式", | ||
| "expected": { | ||
| "tw2sp": "方程式" | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ## 4. Build and Verify | ||
|
|
||
| OpenCC uses the CMake/Make system to build dictionaries. | ||
|
|
||
| 1. **Rebuild dictionaries**: | ||
| ```bash | ||
| cd build/dbg # or your build directory | ||
| make Dictionaries | ||
| ``` | ||
| This step regenerates the `.ocd2` binary dictionaries. | ||
|
|
||
| 2. **Manual verification**: | ||
| Test directly using the generated `opencc` tool. | ||
| ```bash | ||
| echo "方程式" | ./src/tools/opencc -c root/share/opencc/tw2sp.json | ||
| # Expected output: 方程式 | ||
| ``` | ||
|
|
||
| 3. **Automated testing** (optional but recommended): | ||
| Run `make test` or `ctest`. | ||
|
|
||
|
|
||
| ## 5. Commit | ||
|
|
||
| When committing, it is recommended to clearly separate or combine, but must include: | ||
| - Dictionary text file changes (`.txt`) | ||
| - Core test changes (`test/testcases/testcases.json`) | ||
|
|
||
| ```bash | ||
| git add data/dictionary/TWPhrases.txt test/testcases/testcases.json | ||
| git commit -m "Fix(Dictionary): correct conversion for 'XYZ'" | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| # OpenCC Project Overview | ||
|
|
||
| This document compiles the Open Chinese Convert (OpenCC) project information to help quickly familiarize with the code structure, data organization, and accompanying tools. | ||
|
|
||
| ## Project Overview | ||
| - OpenCC is an open-source Chinese Simplified-Traditional and regional variant conversion tool, supporting Simplified↔Traditional, Hong Kong/Macau/Taiwan regional differences, Japanese Shinjitai/Kyujitai character forms, and other conversion schemes. | ||
| - The project provides a C++ core library, C language interface, command-line tools, as well as Python, Node.js and other language bindings. The dictionary and program are decoupled for easy customization and extension. | ||
| - Main dependencies: `rapidjson` for configuration parsing, `marisa-trie` for high-performance dictionaries (`.ocd2`), optional `Darts` for legacy `.ocd` support. | ||
|
|
||
| ## Data and Configuration | ||
| - Dictionaries are maintained in `data/dictionary/*.txt`, covering phrases, characters, regional differences, Japanese new characters, and other topic files; converted to `.ocd2` during build for acceleration. | ||
| - Default configurations are located in `data/config/`, such as `s2t.json`, `t2s.json`, `s2tw.json`, etc., defining segmenter types, dictionaries used, and combination methods. | ||
| - `data/scheme` and `data/scripts` provide dictionary compilation scripts and specification validation tools. | ||
|
|
||
| ### Dictionary Binary Formats: `.ocd` and `.ocd2` | ||
| - `.ocd` (legacy format) has `OPENCCDARTS1` as the file header, with the main body being serialized Darts double-array trie data, combined with `BinaryDict` structure to store key-value offsets and concatenation buffers. Loading process is detailed in `src/DartsDict.cpp` and `src/BinaryDict.cpp`. Commonly used in environments requiring `ENABLE_DARTS` for compatibility. | ||
| - `.ocd2` (default format) has `OPENCC_MARISA_0.2.5` as the file header, followed by `marisa::Trie` data, then uses the `SerializedValues` module to store all candidate value lists. See `src/MarisaDict.cpp`, `src/SerializedValues.cpp` for details. This format is smaller and loads faster (e.g., `NEWS.md` records `STPhrases` reduced from 4.3MB to 924KB). | ||
| - The command-line tool `opencc_dict` supports `text ↔ ocd2` (and optionally `ocd`) conversion. When adding or adjusting dictionaries, first edit `.txt`, then run the tool to generate the target format. | ||
|
|
||
| ## Development and Testing | ||
| - The top-level build system supports CMake, Bazel, Node.js `binding.gyp`, Python `pyproject.toml`, with cross-platform CI integration. | ||
| - `src/*Test.cpp`, `test/` directories contain Google Test-style unit tests covering dictionary matching, conversion chains, segmentation, and other key logic. | ||
| - Tools `opencc_dict`, `opencc_phrase_extract` (`src/tools/`) help developers convert dictionary formats and extract phrases. | ||
|
|
||
| ## Ecosystem Bindings | ||
| - Python module is located in `python/`, providing the `OpenCC` class through the C API. | ||
| - Node.js extension is in the `node/` directory, using N-API/Node-API to call the core library. | ||
| - README lists third-party Swift, Java, Go, WebAssembly and other porting projects, showcasing ecosystem breadth. | ||
|
|
||
| ## Common Customization Steps | ||
| 1. Edit or add dictionary entries in `data/dictionary/*.txt`. | ||
| 2. Use `opencc_dict` to convert to `.ocd2`. | ||
| 3. Copy/modify configuration JSON in `data/config` and specify new dictionary files. | ||
| 4. Load custom configuration through `SimpleConverter`, command-line tools, or language bindings to verify results. | ||
|
|
||
| > For deeper understanding, read the module documentation in `src/README.md`, or refer to test cases in `test/` to understand conversion chain combinations. | ||
|
|
||
| ### Common Deviations in Third-Party Implementations (Speculation) | ||
| - **Missing segmentation and conversion chain order**: If `group` configuration or dictionary priority is not restored, compound words may be split apart or overwritten by single characters. | ||
| - **Missing longest prefix logic**: Character-by-character replacement alone will miss idioms and multi-character word results. | ||
| - **Improper UTF-8 handling**: Overlooking multi-byte characters or surrogate pair handling can easily cause offset or truncation issues. | ||
| - **Incomplete dictionaries/configuration**: Missing segmentation dictionaries, regional differences and other `.ocd2` files will result in missing words in output. | ||
| - **Path and loading process differences**: If OpenCC's path search and configuration parsing details are not followed, the actual loaded resources will differ from official ones, naturally leading to different results. | ||
|
|
||
| ## Further Reading | ||
|
|
||
| ### Contribution Guide | ||
| - **[CONTRIBUTING.md](CONTRIBUTING.md)** - Complete guide on how to contribute dictionary entries to OpenCC, write test cases, and execute testing procedures. | ||
|
|
||
| ### Project Documents | ||
| - **[src/README.md](src/README.md)** - Detailed technical documentation for core modules. | ||
| - **[README.md](README.md)** - Project overview, installation and usage guide. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| @AGENTS.md |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.