Skip to content

Libv-Team/UNI-SO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŒ Can Large Language Models Translate Spoken-Only Languages through International Phonetic Transcription?

[ ๐Ÿค—Huggingface ]

coming soon

This work introduces two key contributions to address the challenges of applying LLMs to spoken-only languages:

  1. UNILANG Framework: A novel approach that enables LLMs to translate spoken-only languages using IPA as an intermediate representation.
  2. SOLAN Dataset: The first large-scale bilingual dataset featuring a spoken-only language, Bai, with aligned Chinese and English translations.

๐Ÿง  UNILANG Framework

UNILANG is the first framework designed to help LLMs understand and translate spoken-only languages by leveraging automatic dictionary construction and knowledge retrieval.

Backbone

  • Qwen2.5
  • Llama3
  • Gemma3

๐Ÿ“š SOLAN Dataset

SOLAN is a curated dataset for spoken-only language translation. It features the Bai language (ISO 639-3: bfs), along with professionally aligned Chinese and English translations.

๐ŸŽ™๏ธ Data Collection

We developed the SOLAN App, a custom Flutter application, to collect high-quality audio samples from native Bai speakers through in-person interviews.

๐Ÿ› ๏ธ Data Processing

Each audio sample is carefully segmented, transcribed using IPA, and validated by linguistic experts to ensure it is suitable for both linguistic analysis and machine learning.

๐Ÿš€ Requirements

To run the evaluation and processing scripts:

pip install sacrebleu

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages