Conversions between embedding formats & evaluations against industry standard datasets.
python2 evaluate.py -h
This scripts is for various embedding evaluation tasks, such as:
- analogy task
- linear translation
- GloVe text and binary with optional bias and context terms
- word2vec text and binary
The list can be expanded, you can easily write your custom input format.
For GloVe binary vectors also the corresponding vocabulary should be provided. Floating point precision 32 or 64 bit can be used.
The script reads questions from stdin and answers them, line by line. The answers are written to stdout, additional debug info to stderr. A question can be any linear combination of input terms, such as:
- king -man +woman
- frog
- chinese + river
cos: standard cosine similaritycos_r: the answers are the same as with cosine similarity, but the true similarity of the outcomes is shown. If you do not care the similarity scores, just the answers, then you can use plaincosbecause it is slightly faster.eucl: square of the standard Euclidean metriceucl_r: the standard Euclidean metriceucl_norm: Euclidean metric but vectors are normalized first. This should be the same ascosorcos_rcos_mul: the so called cos-mul metric, used in analogy taskscos_mul0: per default, cos-mul operates on (1+cos) since the cos similarity varies from -1 ot 1. But with mul0, positive vectors are assumed.arccos: arc length distance on the unit sphereeucl_mul: mutiplicative Euclideanangle: same as cosine similarity but you can see the actual angle
The list can be expanded, you can easily write your custom metric or similarity.
Using the translate.py script, you can generate a linear transformation between embeddings.
If you have two embeddings and a transformation between them, then you can query in one language and retrieve answers in the other.
Lets say you have an English source and a German target embedding and a linear transformation matrix, then:
king - man + woman = Königin
- numpy
- scipy