Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions .claude/skills/opencc-fix-translation-workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
name: opencc-fix-translation-workflow
description: OpenCC translation fix and complete release workflow
tags: [opencc, workflow, debugging]
---

# OpenCC Translation Fix Standard Operating Procedure

This skill describes the complete lifecycle for fixing OpenCC conversion errors (such as "方程式" becoming "方程序"), including core dictionary correction, testing, and verification.

## 1. Problem Diagnosis

When a conversion error is discovered (e.g., A is incorrectly converted to B):

1. **Search for existing mappings**:
Use `grep` to search for the error source in `data/dictionary`.
```bash
grep "error_term" data/dictionary/*.txt
```
2. **Identify the interference source**:
Usually because in Maximum Forward Matching (MaxMatch), a "longer word" contains the target word, or a "shorter word" mapping causes the incorrect result.
*Example*: "方程式" is incorrectly converted to "方程序" because the mapping "程式" → "程序" exists, and "方程式" is not defined as a proper noun, causing it to be segmented as "方" + "程式".

## 2. Fix Solution (Explicit Mapping)

If the error originates from segmentation logic (as in the example above), the most robust fix is to **add an Explicit Mapping**.

1. **Select the correct dictionary file**:
- For s2twp and tw2sp: `TWPhrases.txt`

2. **Add the mapping**:
Map the vocabulary to itself to prevent incorrect segmentation or conversion.
```text
方程式 方程式
```
*Note*: Maintain dictionary alphabetical sorting (if applicable).

## 3. Test-Driven (Test Cases)

Before the modification takes effect, create test cases to ensure the fix and prevent regression.

1. **Core tests**:
Edit `test/testcases/testcases.json`.
```json
{
"id": "case_XXX",
"input": "方程式",
"expected": {
"tw2sp": "方程式"
}
}
```

## 4. Build and Verify

OpenCC uses the CMake/Make system to build dictionaries.

1. **Rebuild dictionaries**:
```bash
cd build/dbg # or your build directory
make Dictionaries
```
This step regenerates the `.ocd2` binary dictionaries.

2. **Manual verification**:
Test directly using the generated `opencc` tool.
```bash
echo "方程式" | ./src/tools/opencc -c root/share/opencc/tw2sp.json
# Expected output: 方程式
```

3. **Automated testing** (optional but recommended):
Run `make test` or `ctest`.


## 5. Commit

When committing, it is recommended to clearly separate or combine, but must include:
- Dictionary text file changes (`.txt`)
- Core test changes (`test/testcases/testcases.json`)

```bash
git add data/dictionary/TWPhrases.txt test/testcases/testcases.json
git commit -m "Fix(Dictionary): correct conversion for 'XYZ'"
```
52 changes: 52 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# OpenCC Project Overview

This document compiles the Open Chinese Convert (OpenCC) project information to help quickly familiarize with the code structure, data organization, and accompanying tools.

## Project Overview
- OpenCC is an open-source Chinese Simplified-Traditional and regional variant conversion tool, supporting Simplified↔Traditional, Hong Kong/Macau/Taiwan regional differences, Japanese Shinjitai/Kyujitai character forms, and other conversion schemes.
- The project provides a C++ core library, C language interface, command-line tools, as well as Python, Node.js and other language bindings. The dictionary and program are decoupled for easy customization and extension.
- Main dependencies: `rapidjson` for configuration parsing, `marisa-trie` for high-performance dictionaries (`.ocd2`), optional `Darts` for legacy `.ocd` support.

## Data and Configuration
- Dictionaries are maintained in `data/dictionary/*.txt`, covering phrases, characters, regional differences, Japanese new characters, and other topic files; converted to `.ocd2` during build for acceleration.
- Default configurations are located in `data/config/`, such as `s2t.json`, `t2s.json`, `s2tw.json`, etc., defining segmenter types, dictionaries used, and combination methods.
- `data/scheme` and `data/scripts` provide dictionary compilation scripts and specification validation tools.

### Dictionary Binary Formats: `.ocd` and `.ocd2`
- `.ocd` (legacy format) has `OPENCCDARTS1` as the file header, with the main body being serialized Darts double-array trie data, combined with `BinaryDict` structure to store key-value offsets and concatenation buffers. Loading process is detailed in `src/DartsDict.cpp` and `src/BinaryDict.cpp`. Commonly used in environments requiring `ENABLE_DARTS` for compatibility.
- `.ocd2` (default format) has `OPENCC_MARISA_0.2.5` as the file header, followed by `marisa::Trie` data, then uses the `SerializedValues` module to store all candidate value lists. See `src/MarisaDict.cpp`, `src/SerializedValues.cpp` for details. This format is smaller and loads faster (e.g., `NEWS.md` records `STPhrases` reduced from 4.3MB to 924KB).
- The command-line tool `opencc_dict` supports `text ↔ ocd2` (and optionally `ocd`) conversion. When adding or adjusting dictionaries, first edit `.txt`, then run the tool to generate the target format.

## Development and Testing
- The top-level build system supports CMake, Bazel, Node.js `binding.gyp`, Python `pyproject.toml`, with cross-platform CI integration.
- `src/*Test.cpp`, `test/` directories contain Google Test-style unit tests covering dictionary matching, conversion chains, segmentation, and other key logic.
- Tools `opencc_dict`, `opencc_phrase_extract` (`src/tools/`) help developers convert dictionary formats and extract phrases.

## Ecosystem Bindings
- Python module is located in `python/`, providing the `OpenCC` class through the C API.
- Node.js extension is in the `node/` directory, using N-API/Node-API to call the core library.
- README lists third-party Swift, Java, Go, WebAssembly and other porting projects, showcasing ecosystem breadth.

## Common Customization Steps
1. Edit or add dictionary entries in `data/dictionary/*.txt`.
2. Use `opencc_dict` to convert to `.ocd2`.
3. Copy/modify configuration JSON in `data/config` and specify new dictionary files.
4. Load custom configuration through `SimpleConverter`, command-line tools, or language bindings to verify results.

> For deeper understanding, read the module documentation in `src/README.md`, or refer to test cases in `test/` to understand conversion chain combinations.

### Common Deviations in Third-Party Implementations (Speculation)
- **Missing segmentation and conversion chain order**: If `group` configuration or dictionary priority is not restored, compound words may be split apart or overwritten by single characters.
- **Missing longest prefix logic**: Character-by-character replacement alone will miss idioms and multi-character word results.
- **Improper UTF-8 handling**: Overlooking multi-byte characters or surrogate pair handling can easily cause offset or truncation issues.
- **Incomplete dictionaries/configuration**: Missing segmentation dictionaries, regional differences and other `.ocd2` files will result in missing words in output.
- **Path and loading process differences**: If OpenCC's path search and configuration parsing details are not followed, the actual loaded resources will differ from official ones, naturally leading to different results.

## Further Reading

### Contribution Guide
- **[CONTRIBUTING.md](CONTRIBUTING.md)** - Complete guide on how to contribute dictionary entries to OpenCC, write test cases, and execute testing procedures.

### Project Documents
- **[src/README.md](src/README.md)** - Detailed technical documentation for core modules.
- **[README.md](README.md)** - Project overview, installation and usage guide.
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
@AGENTS.md
24 changes: 24 additions & 0 deletions src/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,29 @@
# Source code

## Code Modules and Flow
1. **Configuration Loading (`src/Config.cpp`)**
- Reads JSON configuration (located in `data/config/*.json`), parses segmenter definitions and conversion chains.
- Loads different dictionary formats (plain text, `ocd2`, dictionary groups) based on the `type` field, with support for additional search paths.
- Creates `Converter` objects that hold segmenters and conversion chains.

2. **Segmentation (`src/MaxMatchSegmentation.cpp`)**
- The default segmentation type is `mmseg`, i.e., Maximum Forward Matching.
- Performs longest prefix matching using the dictionary, splitting input into `Segments`; unmatched UTF-8 fragments are preserved by character length.

3. **Conversion Chain (`src/ConversionChain.cpp`, `src/Conversion.cpp`)**
- The conversion chain is an ordered list of `Conversion` objects, each node relies on a dictionary to replace segments with target values through longest prefix matching.
- Supports advanced scenarios like phrase priority, variant character replacement, and multi-stage composition.

4. **Dictionary System**
- Abstract interface `Dict` unifies prefix matching, all-prefix matching, and dictionary traversal.
- `TextDict` (`.txt`) builds dictionaries from tab-delimited plain text; `MarisaDict` (`.ocd2`) provides high-performance trie structures; `DictGroup` can compose multiple dictionaries into a sequential collection.
- `SerializableDict` defines serialization and file loading logic, which command-line tools use to convert between different formats.

5. **API Encapsulation**
- `SimpleConverter` (high-level C++ interface) encapsulates `Config + Converter`, providing various overloads for string, pointer buffer, and partial length conversion.
- `opencc.h` exposes the C API: `opencc_open`, `opencc_convert_utf8`, etc., for language bindings and command-line reuse.
- The command-line program `opencc` (`src/tools/CommandLine.cpp`) demonstrates batch conversion, stream reading, auto-flushing, and same-file input/output handling.

## Dictionary

### Interface
Expand Down
Loading