Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

This is the official repository for the paper:

Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
Deokhyung Kang, Seonjeong Hwang, Daehui Kim, Hyounghun Kim, Gary Geunbae Lee

📄 [arXiv:2510.27269]

Overview

This repository provides the analysis, detection, and mitigation framework proposed in the paper: Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

The repository is designed as a paper-aligned companion repository, enabling direct and faithful reproduction of the paper’s main experimental results.

Specifically, this repo supports:

Residual-based attribution analysis of multilingual reasoning gaps (Section 3)
Detection of understanding failures from reasoning traces (Section 4)
Mitigation via Selective Translation guided by failure detection (Section 5)

Each major section of the paper is mapped to a corresponding directory with a dedicated README explaining how to reproduce the results step by step.

Requirements & Setup

Environment Setup

All experiments can be run in a Conda environment with Python 3.12.

conda create -n rlm_analysis python=3.12
conda activate rlm_analysis

pip install -e .

Next, install a modified version of fastText that is compatible with numpy>=2.0.

git clone https://github.com/deokhk/fastText_numpy_2.0.git
cd fastText_numpy_2.0
pip install -e .

Required External Resources

We use FastText for language identification.

Please download the FastText language identification model (lid.176.ftz) and move it to the misc/ directory:

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
mv lid.176.ftz ./misc/

Paper ↔ Code Alignment

The repository structure is organized to closely mirror the paper’s experimental pipeline.

RLM_analysis/
├── experiments/              # Guide for reproducing the paper is provided here
├── misc/                     # Where the lid.176.ftz file should be placed
├── outputs/                  # Where outputs for all experiments will be saved
├── src/                      # Source code
├── translated_data/          # Translated queries
└── pyproject.toml            # For installing requirements

Each major component under the ./experiments directory corresponds to a section of the paper:

Paper Section	Topic	Code Entry Point
§3	Why Does the Multilingual Reasoning Gap Emerge?	`3_multilingual_reasoning_gap_attribution/`
§4	Detecting Understanding Failures	`4_detecting_understanding_failures/`
§5	Selective Translation	`5_selective_translation/`

Each directory contains its own README with detailed instructions for reproducing the corresponding experiments, including required inputs, scripts, and expected outputs.

Pretrained Models and Released Outputs (Hugging Face)

To facilitate reproducibility, we provide pretrained models and experiment outputs via Hugging Face Hub.

1. Experiment Outputs and Prober Checkpoints

We release experiment outputs for Qwen3-4B, including prober checkpoints and intermediate results, as a Hugging Face dataset: deokhk/multilingual_reasoning_gap_outputs

You may download and place these files directly under:

./outputs/

You can download the files using the following example:

from huggingface_hub import snapshot_download

repo_id = "deokhk/multilingual_reasoning_gap_outputs"

snapshot_download(
    repo_id=repo_id,
    repo_type="dataset",
    local_dir="./outputs/",
    local_dir_use_symlinks=False,
)

2. mmBERT Understanding Failure Detector

The fine-tuned mmBERT understanding failure detector used in Section 4 is released at the Hugging Face repository: deokhk/mmbert_ft_understandability_Qwen3-4B

After downloading, place the contents under:

./outputs/experiments/mmbert_ft_understandability/Qwen3-4B/

Reproducing the Paper

To reproduce the full experimental pipeline, we recommend running the sections in order:

Section 3 – Multilingual reasoning gap attribution
Section 4 – Understanding failure detection
Section 5 – Selective Translation with failure-aware routing

Please refer to the README inside each section directory for detailed, step-by-step instructions.

Citation

If you find this work useful, please cite:

@misc{kang2025multilingualreasoninggapsemerge,
      title={Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?}, 
      author={Deokhyung Kang and Seonjeong Hwang and Daehui Kim and Hyounghun Kim and Gary Geunbae Lee},
      year={2025},
      eprint={2510.27269},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.27269}, 
}

License

This project is released under the Apache 2.0 License. Please refer to individual datasets and model licenses for their respective terms.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
experiments		experiments
misc		misc
src/rlm_analysis		src/rlm_analysis
translated_data		translated_data
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Overview

Requirements & Setup

Environment Setup

Required External Resources

Paper ↔ Code Alignment

Pretrained Models and Released Outputs (Hugging Face)

1. Experiment Outputs and Prober Checkpoints

2. mmBERT Understanding Failure Detector

Reproducing the Paper

Citation

License

About

Uh oh!

Releases

Packages

Languages

deokhk/RLM_analysis

Folders and files

Latest commit

History

Repository files navigation

Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

Overview

Requirements & Setup

Environment Setup

Required External Resources

Paper ↔ Code Alignment

Pretrained Models and Released Outputs (Hugging Face)

1. Experiment Outputs and Prober Checkpoints

2. mmBERT Understanding Failure Detector

Reproducing the Paper

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages