Juexiao Zhang*, Yubei Chen*, Brian Cheung and Bruno Olshausen
This repository hosts our work Word Embedding Visualization via Dictionary Learning.
The arxiv preprint is at this link.
An outline of the files contained in thie repository:
-
sparsify_Pytorch.pyis our library for dictionary learning. -
WordFactor_reproduce.ipynbandFactorGroup_reproduce.ipynbare our reorginzed and renewed notebooks for to show how to learn a dicionary containing the word factors and reproduce the results. The reader can refer to the notebooks for more details -
datadirectory stores the corpus data used for training, for example text8, download. -
embeddingsdirectory stores the pretrained word embeddings, for example the GloVe. -
resultsdirectory stores the results you can obtain from running the notebooks. Take the provided as an example,basis.ptstores the trained dictionary elements, aka the word factors.nmed_factor_cooc.npyis the normalized factor cooccurrence matrix andsym_labels_knn20_c175.npystores the factor clustering labels. Both are obtained fromFactorGroup_reproduce.ipynb.
Follow the instructions in the notebook, particularly WordFactor_reproduce.ipynb, and have the corpus and embeddings placed in data/ and embeddings respectively. The reader should be able to reproduce the results of results/glove-text8-reproduce-1k-factors. Please refer to the notebooks for specific instructions.
This project is tested with:
-
Python 3.7
-
PyTorch 1.1.0
-
scipy 1.2.0
-
scikit-learn 0.21.2
-
matplotlib 3.1.0
-
plotly 4.1.1
@article{DBLP:journals/corr/abs-1910-03833,
author = {Juexiao Zhang and
Yubei Chen and
Brian Cheung and
Bruno A. Olshausen},
title = {Word Embedding Visualization Via Dictionary Learning},
journal = {CoRR},
volume = {abs/1910.03833},
year = {2019},
url = {http://arxiv.org/abs/1910.03833},
eprinttype = {arXiv},
eprint = {1910.03833},
timestamp = {Wed, 16 Oct 2019 16:25:53 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1910-03833.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
