LangTagger is a Language Tagger that uses a probabilistic model for language classification. The LangTagger consists of two models, which are LangTag(S) the simple model and LangTag(C) the combined model. LangTag(S) is trained using only QALD 7 training dataset. Therefore it supports the languages English, Deutsch, French, Spanish, Brazilian Portuguese, Dutch, Hindi, Romanian and Persian. The LangTag(C) model is trained using all the QALD 3 to QALD 9 training datasets. Therefore this model supports two more languages than the LangTag(S) model. They are Portuguese and Russian.
To assess the efficiency of the different models and frameworks, we desgined threedifferent text length and domain benchmarks (1) Short texts (rdfs:labels), (2) QA and (3)Long texts (dbo:abstracts).
Short: The short text benchmark uses the first10.000 entityrdfs:labelsofeach language returned by the DBpedia SPARQL endpoint if possible, excluding resources containing digits. It is designed to measure the efficiency of the different approaches on identifying a label language. We used English, German, Russian, Italian, Spanish, French and Portuguese language for the test.
QA: The QA benchmark uses all questions in the QuestionOver Linked Data (QALD) datasets in different forms, Keywords (K) and FullQuestions (F). It is designed to evaluate the efficiency of the different approaches in the Question and Answering (QA) domain. The efficiency of the models areassessed on detecting the language of a question containing a knowledge base resource. The QALD test benchmark consists of following languages,
- QALD 1 : English.
- QALD 2 : English.
- QALD 3 : English, German, French, Spanish, Italian and Dutch.
- QALD 4 : English, German, French, Spanish, Italian, Dutch and Romanian.
- QALD 5 : English, German, French, Spanish, Italian, Dutch and Romanian.
- QALD 6 : English, German, French, Spanish, Italian, Dutch, Romanian and Persian.
- QALD 7: English, German, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian and Persian.
- QALD 8 : English
- QALD 9 : English, Deutsch, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian, Persian, Portuguese and Russian.
Long: The long text benchmark uses thedbo:abstractsof the top 10.000resources returned by the DBpedia SPARQL endpoint–if possible. It is designedto evalute the efficiency of different language identification approaches on longresource texts. We used English,German, Russian, Italian, Spanish, and French language for the test.
Results achieved by different approaches on all languages of QALD testbenchmark in Full (F) and Keyword (K) questions
| QALD | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Questions | F | F | K | F | K | F | K | F | K | F | K | F | K | F | K | F |
| LangTag(S) | 1.0 | 1.0 | 0.70 | 0.99 | 0.77 | 0.99 | 0.77 | 1.00 | 0.76 | 0.99 | 0.67 | 0.98 | 0.48 | 1.00 | 0.70 | 0.97 |
| LangTag(C) | 1.0 | 1.0 | 0.86 | 0.99 | 0.90 | 0.99 | 0.92 | 1.00 | 0.81 | 0.99 | 0.93 | 1.00 | 0.70 | 1.00 | 0.84 | 0.97 |
| langdetect | 0.96 | 0.96 | 0.65 | 0.93 | 0.76 | 0.92 | 0.72 | 0.92 | 0.68 | 0.91 | 0.76 | 0.95 | 0.51 | 1.00 | 0.65 | 0.82 |
| Tika | 0.96 | 0.93 | 0.61 | 0.88 | 0.70 | 0.90 | 0.66 | 0.91 | 0.63 | 0.89 | 0.72 | 0.91 | 0.56 | 0.97 | 0.61 | 0.80 |
| openNLP | 0.96 | 0.97 | 0.48 | 0.89 | 0.62 | 0.89 | 0.61 | 0.85 | 0.48 | 0.75 | 0.62 | 0.90 | 0.39 | 0.95 | 0.41 | 0.73 |
| openNLP(12) | 0.98 | 0.98 | 0.70 | 0.96 | 0.76 | 0.95 | 0.76 | 0.94 | 0.75 | 0.93 | 0.83 | 0.97 | 0.56 | 1.00 | 0.81 | 0.95 |
| langdetect(12) | 0.96 | 0.93 | 0.67 | 0.90 | 0.76 | 0.91 | 0.72 | 0.91 | 0.69 | 0.89 | 0.75 | 0.92 | 0.58 | 1.00 | 0.66 | 0.82 |
| langid | 0.98 | 0.94 | 0.62 | 0.93 | 0.72 | 0.94 | 0.64 | 0.95 | 0.68 | 0.91 | 0.64 | 0.93 | 0.65 | 1.00 | 0.64 | 0.82 |
| QALD | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Questions | F | F | K | F | K | F | K | F | K | F | K | F | K | F | K | F |
| LangTag(S) | 1.00 | 1.00 | 0.69 | 1.00 | 0.80 | 1.00 | 0.77 | 1.00 | 0.80 | 1.00 | 0.60 | 1.00 | 0.48 | 1.00 | 0.72 | 1.00 |
| LangTag(C) | 1.0 | 1.0 | 0.87 | 1.00 | 0.98 | 1.00 | 0.93 | 1.00 | 0.83 | 1.00 | 0.93 | 1.00 | 0.70 | 1.00 | 0.87 | 1.00 |
| langdetect | 0.96 | 0.96 | 0.53 | 0.96 | 0.68 | 0.94 | 0.67 | 0.94 | 0.70 | 0.95 | 0.65 | 0.93 | 0.51 | 1.00 | 0.68 | 0.92 |
| Tika | 0.96 | 0.93 | 0.51 | 0.97 | 0.68 | 0.92 | 0.61 | 0.91 | 0.65 | 0.94 | 0.67 | 0.96 | 0.56 | 0.95 | 0.64 | 0.93 |
| openNLP | 0.96 | 0.97 | 0.52 | 0.97 | 0.70 | 0.92 | 0.67 | 0.91 | 0.63 | 0.94 | 0.62 | 0.96 | 0.39 | 0.95 | 0.58 | 0.93 |
| openNLP(12) | 0.98 | 0.98 | 0.70 | 0.98 | 0.82 | 0.96 | 0.82 | 0.98 | 0.83 | 0.98 | 1.00 | 0.79 | 1.00 | 0.56 | 0.80 | 0.98 |
| langdetect(12) | 0.98 | 0.93 | 0.55 | 0.93 | 0.72 | 0.90 | 0.69 | 0.98 | 0.72 | 0.96 | 0.74 | 0.88 | 0.56 | 1.00 | 0.66 | 0.93 |
| langid | 0.98 | 0.94 | 0.52 | 0.94 | 0.60 | 0.96 | 0.61 | 0.96 | 0.67 | 0.94 | 0.55 | 0.95 | 0.65 | 1.00 | 0.59 | 0.94 |
| QALD | 3 | 4 | 5 | 6 | 7 | 9 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Questions | K | F | K | F | K | F | K | F | K | F | K | F |
| LangTag(S) | 0.87 | 1.00 | 0.88 | 1.00 | 0.93 | 1.00 | 0.86 | 0.99 | 0.90 | 1.00 | 0.88 | 1.00 |
| LangTag(C) | 0.80 | 1.00 | 0.88 | 1.00 | 0.93 | 1.00 | 0.86 | 0.99 | 0.90 | 1.00 | 0.88 | 1.00 |
| langdetect | 0.80 | 0.95 | 0.80 | 0.92 | 0.77 | 0.91 | 0.71 | 0.88 | 0.74 | 0.95 | 0.81 | 0.94 |
| Tika | 0.79 | 0.95 | 0.78 | 0.92 | 0.79 | 0.94 | 0.71 | 0.88 | 0.69 | 0.95 | 0.81 | 0.94 |
| openNLP | 0.42 | 0.88 | 0.54 | 0.80 | 0.59 | 0.81 | 0.39 | 0.74 | 0.48 | 0.79 | 0.48 | 0.82 |
| openNLP(12) | 0.68 | 0.92 | 0.70 | 0.84 | 0.77 | 0.85 | 0.54 | 0.80 | 0.76 | 0.93 | 0.72 | 0.92 |
| langdetect(12) | 0.80 | 0.95 | 0.78 | 0.90 | 0.83 | 0.93 | 0.75 | 0.85 | 0.76 | 0.90 | 0.81 | 0.94 |
| langid | 0.70 | 0.93 | 0.82 | 0.94 | 0.75 | 0.95 | 0.71 | 0.92 | 0.67 | 0.90 | 0.78 | 0.94 |
| QALD | 3 | 4 | 5 | 6 | 7 | 9 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Questions | K | F | K | F | K | F | K | F | K | F | K | F |
| LangTag(S) | 0.72 | 0.98 | 0.72 | 1.00 | 0.79 | 1.00 | 0.74 | 0.99 | 0.62 | 0.97 | 0.66 | 0.99 |
| LangTag(C) | 0.88 | 1.00 | 0.84 | 1.00 | 0.93 | 1.00 | 0.78 | 0.99 | 0.90 | 1.00 | 0.80 | 0.99 |
| langdetect | 0.61 | 0.90 | 0.84 | 0.98 | 0.86 | 0.96 | 0.69 | 0.92 | 0.88 | 1.00 | 0.77 | 0.94 |
| Tika | 0.61 | 0.89 | 0.76 | 0.98 | 0.79 | 0.93 | 0.65 | 0.93 | 0.79 | 1.00 | 0.73 | 0.96 |
| openNLP | 0.51 | 0.86 | 0.62 | 0.90 | 0.58 | 0.80 | 0.59 | 0.82 | 0.65 | 0.90 | 0.62 | 0.82 |
| openNLP(12) | 0.70 | 0.94 | 0.76 | 0.96 | 0.68 | 0.96 | 0.68 | 0.90 | 0.93 | 0.97 | 0.78 | 0.91 |
| langdetect(12) | 0.73 | 0.88 | 0.76 | 0.98 | 0.68 | 0.93 | 0.66 | 0.91 | 0.83 | 0.95 | 0.76 | 0.92 |
| langid | 0.75 | 0.96 | 0.88 | 0.96 | 0.75 | 1.00 | 0.78 | 0.94 | 0.79 | 0.97 | 0.84 | 0.92 |
| Approach | EN | DE | RU | IT | ES | FR | PT | AVG | |
|---|---|---|---|---|---|---|---|---|---|
| #Resources | 10,000 | 10,000 | 83 | 243 | 10,000 | 782 | 227 | Accuracy | Runtime(s) |
| LangTag(S) | 0.21 | 0.91 | - | 0.25 | 0.09 | 0.34 | 0.36 | 0.36 | 0.00162 |
| LangTag(C) | 0.26 | 0.88 | 0.12 | 0.35 | 0.15 | 0.36 | 0.44 | 0.34 | 0.00186 |
| langdetect | 0.40 | 0.43 | 0.57 | 0.63 | 0.31 | 0.59 | 0.43 | 0.48 | 0.01761 |
| Tika | 0.24 | 0.39 | 50 | 0.68 | 0.15 | 0.59 | 0.35 | 0.41 | 0.41428 |
| openNLP | 0.16 | 0.18 | 0.12 | 0.30 | 0.15 | 0.33 | 0.25 | 0.21 | 0.01125 |
| openNLP(12) | 0.75 | 0.37 | 0.98 | 0.80 | 0.37 | 0.59 | 0.52 | 0.62 | 0.05361 |
| langdetect(12) | 0.35 | 0.51 | 0.59 | 0.67 | 0.32 | 0.59 | 0.43 | 0.49 | 0.03611 |
| langid | 0.69 | 0.33 | 0.56 | 0.44 | 0.33 | 0.57 | 0.28 | 0.45 | 0.01651 |
| Approach | EN | DE | RU | IT | ES | FR | AVG | |
|---|---|---|---|---|---|---|---|---|
| #Resources | 10,000 | 10,000 | 285 | 10,000 | 10,000 | 10,000 | Accuracy | Runtime(s) |
| LangTag(S) | 0.96 | 0.99 | - | 0.99 | 0.99 | 0.99 | 0.98 | 0.00267 |
| LangTag(C) | 0.96 | 0.99 | 0.86 | 0.99 | 0.99 | 0.99 | 0.96 | 0.00287 |
| langdetect | 0.95 | 0.99 | 0.95 | 0.99 | 0.99 | 0.99 | 0.97 | 0.01657 |
| Tika | 0.95 | 0.99 | 0.95 | 0.99 | 0.98 | 0.99 | 0.97 | 0.43918 |
| openNLP | 0.79 | 0.81 | 0.13 | 0.76 | 0.78 | 0.71 | 0.66 | 0.01427 |
| openNLP(12) | 0.95 | 0.98 | 0.98 | 0.99 | 0.99 | 0.99 | 0.98 | 0.18625 |
| langdetect(12) | 0.95 | 0.99 | 0.95 | 0.99 | 0.99 | 0.99 | 0.97 | 0.02183 |
| langid | 0.96 | 0.97 | 0.94 | 0.99 | 0.98 | 0.99 | 0.97 | 0.03579 |
| QALD | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Questions | K | F | K | F | K | F | K | F | K | F | K | F | K | F |
| LangTag(S) | 0.0003 | 0.0003 | 0.0006 | 0.0003 | 0.0006 | 0.0003 | 0.0002 | 0.0002 | 0.0019 | 0.0004 | 0.0041 | 0.0014 | 0.0001 | 0.0002 |
| LangTag(C) | 0.0018 | 0.0012 | 0.0026 | 0.0021 | 0.0029 | 0.0022 | 0.0017 | 0.0011 | 0.0036 | 0.0031 | 0.0131 | 0.0120 | 0.0017 | 0.0012 |
| langdetect | 0.0087 | 0.0063 | 0.0079 | 0.0042 | 0.0072 | 0.0057 | 0.0078 | 0.0054 | 0.0082 | 0.0041 | 0.0092 | 0.0021 | 0.0075 | 0.0116 |
| Tika | 1.5677 | 1.4068 | 1.4021 | 1.4009 | 1.6072 | 1.3928 | 1.5981 | 1.3978 | 1.4379 | 1.3955 | 1.4213 | 1.3778 | 1.9081 | 1.4836 |
| openNLP | 0.0027 | 0.0011 | 0.0036 | 0.0039 | 0.0035 | 0.0030 | 0.0023 | 0.0011 | 0.0058 | 0.0062 | 0.0032 | 0.0026 | 0.0012 | 0.0014 |
model size in Megabytes (MB) andKilobytes (KB) achieved by different approaches on QALD test benchmarks.
| Approach | Model Size | #Languages |
|---|---|---|
| LangTag(S) | 8.2 KB | 10 |
| LangTag(C) | 9.7 KB | 12 |
| langdetect | 981.5 KB | 55 |
| Tika | 74.9 MB | 18 |
| openNLP | 10.6 MB | 103 |
| langid | 1.9 MB | 97 |