Skip to content

AKSW/LangTagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LangTagger

LangTagger is a Language Tagger that uses a probabilistic model for language classification. The LangTagger consists of two models, which are LangTag(S) the simple model and LangTag(C) the combined model. LangTag(S) is trained using only QALD 7 training dataset. Therefore it supports the languages English, Deutsch, French, Spanish, Brazilian Portuguese, Dutch, Hindi, Romanian and Persian. The LangTag(C) model is trained using all the QALD 3 to QALD 9 training datasets. Therefore this model supports two more languages than the LangTag(S) model. They are Portuguese and Russian.

Benchmarks

To assess the efficiency of the different models and frameworks, we desgined threedifferent text length and domain benchmarks (1) Short texts (rdfs:labels), (2) QA and (3)Long texts (dbo:abstracts).

Short: The short text benchmark uses the first10.000 entityrdfs:labelsofeach language returned by the DBpedia SPARQL endpoint if possible, excluding resources containing digits. It is designed to measure the efficiency of the different approaches on identifying a label language. We used English, German, Russian, Italian, Spanish, French and Portuguese language for the test.

QA: The QA benchmark uses all questions in the QuestionOver Linked Data (QALD) datasets in different forms, Keywords (K) and FullQuestions (F). It is designed to evaluate the efficiency of the different approaches in the Question and Answering (QA) domain. The efficiency of the models areassessed on detecting the language of a question containing a knowledge base resource. The QALD test benchmark consists of following languages,

  • QALD 1 : English.
  • QALD 2 : English.
  • QALD 3 : English, German, French, Spanish, Italian and Dutch.
  • QALD 4 : English, German, French, Spanish, Italian, Dutch and Romanian.
  • QALD 5 : English, German, French, Spanish, Italian, Dutch and Romanian.
  • QALD 6 : English, German, French, Spanish, Italian, Dutch, Romanian and Persian.
  • QALD 7: English, German, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian and Persian.
  • QALD 8 : English
  • QALD 9 : English, Deutsch, French, Spanish, Brazilian Portuguese, Italian, Dutch, Hindi, Romanian, Persian, Portuguese and Russian.

Long: The long text benchmark uses thedbo:abstractsof the top 10.000resources returned by the DBpedia SPARQL endpoint–if possible. It is designedto evalute the efficiency of different language identification approaches on longresource texts. We used English,German, Russian, Italian, Spanish, and French language for the test.

Evaluation

Results achieved by different approaches on all languages of QALD testbenchmark in Full (F) and Keyword (K) questions

>
QALD 1 2 3 4 5 6 7 8 9
Questions F F KF KF KF KF KF KF KF
LangTag(S) 1.0 1.0 0.700.99 0.770.99 0.771.00 0.760.99 0.670.98 0.481.00 0.700.97
LangTag(C) 1.0 1.0 0.860.99 0.900.99 0.921.00 0.810.99 0.931.00 0.701.00 0.840.97
langdetect 0.96 0.96 0.650.93 0.760.92 0.720.92 0.680.91 0.760.95 0.511.00 0.650.82
Tika 0.96 0.93 0.610.88 0.700.90 0.660.91 0.630.89 0.720.91 0.560.97 0.610.80
openNLP 0.96 0.97 0.480.89 0.620.89 0.610.85 0.480.75 0.620.90 0.390.95 0.410.73
openNLP(12) 0.98 0.98 0.700.96 0.760.95 0.760.94 0.750.93 0.830.97 0.561.00 0.810.95
langdetect(12) 0.96 0.93 0.670.90 0.760.91 0.720.91 0.690.89 0.750.92 0.581.00 0.660.82
langid 0.98 0.94 0.620.93 0.720.94 0.640.95 0.680.91 0.640.93 0.651.00 0.640.82

Results achieved by different approaches on English questions of QALD testbenchmark.

>
QALD 1 2 3 4 5 6 7 8 9
Questions F F KF KF KF KF KF KF KF
LangTag(S) 1.00 1.00 0.691.00 0.801.00 0.771.00 0.801.00 0.601.00 0.481.00 0.721.00
LangTag(C) 1.0 1.0 0.871.00 0.981.00 0.931.00 0.831.00 0.931.00 0.701.00 0.871.00
langdetect 0.96 0.96 0.530.96 0.680.94 0.670.94 0.700.95 0.650.93 0.511.00 0.680.92
Tika 0.96 0.93 0.510.97 0.680.92 0.610.91 0.650.94 0.670.96 0.560.95 0.640.93
openNLP 0.96 0.97 0.520.97 0.700.92 0.670.91 0.630.94 0.620.96 0.390.95 0.580.93
openNLP(12) 0.98 0.98 0.700.98 0.820.96 0.820.98 0.830.98 1.000.79 1.000.56 0.800.98
langdetect(12) 0.98 0.93 0.550.93 0.720.90 0.690.98 0.720.96 0.740.88 0.561.00 0.660.93
langid 0.98 0.94 0.520.94 0.600.96 0.610.96 0.670.94 0.550.95 0.651.00 0.590.94

Results achieved by different approaches on German questions of QALD testbenchmark

>
QALD 3 4 5 6 7 9
Questions KF KF KF KF KF KF
LangTag(S) 0.871.00 0.881.00 0.931.00 0.860.99 0.901.00 0.881.00
LangTag(C) 0.801.00 0.881.00 0.931.00 0.860.99 0.901.00 0.881.00
langdetect 0.800.95 0.800.92 0.770.91 0.710.88 0.740.95 0.810.94
Tika 0.790.95 0.780.92 0.790.94 0.710.88 0.690.95 0.810.94
openNLP 0.420.88 0.540.80 0.590.81 0.390.74 0.480.79 0.480.82
openNLP(12) 0.680.92 0.700.84 0.770.85 0.540.80 0.760.93 0.720.92
langdetect(12) 0.800.95 0.780.90 0.830.93 0.750.85 0.760.90 0.810.94
langid 0.700.93 0.820.94 0.750.95 0.710.92 0.670.90 0.780.94

Results achieved by different approaches on French questions of QALD testbenchmark

>
QALD 3 4 5 6 7 9
Questions KF KF KF KF KF KF
LangTag(S) 0.720.98 0.721.00 0.791.00 0.740.99 0.620.97 0.660.99
LangTag(C) 0.881.00 0.841.00 0.931.00 0.780.99 0.901.00 0.800.99
langdetect 0.610.90 0.840.98 0.860.96 0.690.92 0.881.00 0.770.94
Tika 0.610.89 0.760.98 0.790.93 0.650.93 0.791.00 0.730.96
openNLP 0.510.86 0.620.90 0.580.80 0.590.82 0.650.90 0.620.82
openNLP(12) 0.700.94 0.760.96 0.680.96 0.680.90 0.930.97 0.780.91
langdetect(12) 0.730.88 0.760.98 0.680.93 0.660.91 0.830.95 0.760.92
langid 0.750.96 0.880.96 0.751.00 0.780.94 0.790.97 0.840.92

Results achieved by different approaches on Entity rdfs:labels

>
Approach EN DE RU IT ES FR PT AVG
#Resources 10,000 10,000 83 243 10,000 782 227 AccuracyRuntime(s)
LangTag(S) 0.21 0.91 - 0.25 0.09 0.34 0.36 0.360.00162
LangTag(C) 0.26 0.88 0.12 0.35 0.15 0.36 0.44 0.340.00186
langdetect 0.40 0.43 0.57 0.63 0.31 0.59 0.43 0.480.01761
Tika 0.24 0.39 50 0.68 0.15 0.59 0.35 0.410.41428
openNLP 0.16 0.18 0.12 0.30 0.15 0.33 0.25 0.210.01125
openNLP(12) 0.75 0.37 0.98 0.80 0.37 0.59 0.52 0.620.05361
langdetect(12) 0.35 0.51 0.59 0.67 0.32 0.59 0.43 0.490.03611
langid 0.69 0.33 0.56 0.44 0.33 0.57 0.28 0.450.01651

Results achieved by different approaches on Abstracts

>
Approach EN DE RU IT ES FR AVG
#Resources 10,000 10,000 285 10,000 10,000 10,000 AccuracyRuntime(s)
LangTag(S) 0.96 0.99 - 0.99 0.99 0.99 0.980.00267
LangTag(C) 0.96 0.99 0.86 0.99 0.99 0.99 0.960.00287
langdetect 0.95 0.99 0.95 0.99 0.99 0.99 0.970.01657
Tika 0.95 0.99 0.95 0.99 0.98 0.99 0.970.43918
openNLP 0.79 0.81 0.13 0.76 0.78 0.71 0.660.01427
openNLP(12) 0.95 0.98 0.98 0.99 0.99 0.99 0.980.18625
langdetect(12) 0.95 0.99 0.95 0.99 0.99 0.99 0.970.02183
langid 0.96 0.97 0.94 0.99 0.98 0.99 0.970.03579

Average runtime in seconds(s) different approaches on on QALD test bench-marks.

>
QALD 3 4 5 6 7 8 9
Questions KF KF KF KF KF KF KF
LangTag(S) 0.00030.0003 0.00060.0003 0.00060.0003 0.00020.0002 0.00190.0004 0.00410.0014 0.00010.0002
LangTag(C) 0.00180.0012 0.00260.0021 0.00290.0022 0.00170.0011 0.00360.0031 0.01310.0120 0.00170.0012
langdetect 0.00870.0063 0.00790.0042 0.00720.0057 0.00780.0054 0.00820.0041 0.00920.0021 0.00750.0116
Tika 1.56771.4068 1.40211.4009 1.60721.3928 1.59811.3978 1.43791.3955 1.42131.3778 1.90811.4836
openNLP 0.00270.0011 0.00360.0039 0.00350.0030 0.00230.0011 0.00580.0062 0.00320.0026 0.00120.0014

model size in Megabytes (MB) andKilobytes (KB) achieved by different approaches on QALD test benchmarks.

>
Approach Model Size #Languages
LangTag(S) 8.2 KB 10
LangTag(C) 9.7 KB 12
langdetect 981.5 KB 55
Tika 74.9 MB 18
openNLP 10.6 MB 103
langid 1.9 MB 97

About

Language Detector

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages