Edeflip: Supervised Word Translation between English and Yoruba

By Ikeoluwa Abioye (Ike.23@dartmouth.edu) and Jiani Ge (Jiani.Ge.23@dartmouth.edu)

Code

main changes to MUSE:
- adapted from MUSE: Multilingual Unsupervised and Supervised Embeddings
- updated deprecated code
- removed sentence translation sections as that was not relevant to our project
analyzed the impact of various embedding types and normalization on the result

Get monolingual word embeddings

For pre-trained monolingual word embeddings, we highly recommend fastText Wikipedia embeddings, or using fastText to train your own word embeddings from your corpus.

The data we used can be found here (https://drive.google.com/drive/folders/1ZVLMym3EIjgEzSEVNQBxkKmrJMbWrm6b) with our log files and data.
You can download the English (en) and Yoruba (yo) embeddings this way:

cd MUSE
# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
# Yoruba fastText Wikipedia embeddings
curl -Lo data/wiki.yo.vec https://drive.google.com/uc?export=download&id=19vfXxahoKDTyNaJoK9grB_i8yvWzfgMj
# Or Yoruba curated FastText embeddings
curl -Lo data/cur.yo.vec https://drive.google.com/uc?export=download&id=13t09-KsbOefIpPEjmbYInimZArtS8lGV

Align monolingual word embeddings

Supervised: using a train bilingual dictionary (or identical character strings as anchor points), learn a mapping from the source to the target space using (iterative) Procrustes alignment.

The supervised way: iterative Procrustes (CPU|GPU)

To learn a mapping between the source and the target space, run:

# for wikipedia Yoruba embeddings
python supervised.py --src_lang en --tgt_lang yo --src_emb data/wiki.en.vec --tgt_emb data/wiki.yo.vec --n_refinement 5 --dico_train default --normalize_embeddings center,renorm --cuda false

# for curated Yoruba embeddings
python supervised.py --src_lang en --tgt_lang yo --src_emb data/wiki.en.vec --tgt_emb data/cur.yo.vec --n_refinement 5 --dico_train default --normalize_embeddings center,renorm --cuda false

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

We also include a simple script to evaluate the quality of cross-lingual word embeddings on several tasks:

Cross-lingual

python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-yo.yo.vec --max_vocab 200000 --cuda false --normalize_embeddings center,renorm

Reports the precision at top k retrievals.

You can visualize crosslingual nearest neighbors using https://colab.research.google.com/drive/12b6cxewcDWo4MEafPiDwDFWCap918cuy.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
MUSE		MUSE
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Edeflip: Supervised Word Translation between English and Yoruba

By Ikeoluwa Abioye (Ike.23@dartmouth.edu) and Jiani Ge (Jiani.Ge.23@dartmouth.edu)

Code

Get monolingual word embeddings

Align monolingual word embeddings

The supervised way: iterative Procrustes (CPU|GPU)

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Iyki/edeflip

Folders and files

Latest commit

History

Repository files navigation

Edeflip: Supervised Word Translation between English and Yoruba

By Ikeoluwa Abioye (Ike.23@dartmouth.edu) and Jiani Ge (Jiani.Ge.23@dartmouth.edu)

Code

Get monolingual word embeddings

Align monolingual word embeddings

The supervised way: iterative Procrustes (CPU|GPU)

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages