Czech and Slovak algorithms.#149
Czech and Slovak algorithms.#149gaboull wants to merge 3 commits intosnowballstem:masterfrom gaboull:master
Conversation
| romanian UTF_8,ISO_8859_2 romanian,ro,rum,ron | ||
| russian UTF_8,KOI8_R russian,ru,rus | ||
| serbian UTF_8 serbian,sr,srp | ||
| slovak UTF_8,ISO_8859_2 slovak,sk,svk |
There was a problem hiding this comment.
The 2 and 3 letter codes should be those specified by ISO 639:
https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
So for Slovak that should be: slovak,sk,slk,slo (those for Czech above are also wrong but I've opened #151 to merge the Czech stemmer since I can write that one up for the website).
|
The process for submitting a new stemmer is documented in The czech stemmer is already on the website so I know that it comes from a paper and who implemented it, so I can easily fill that in and I've created a test vocabulary from wikipedia data (in #151). I don't know any background to the slovak algorithm here though. |
| ) | ||
| ) | ||
|
|
||
| define lower_case as repeat ( |
There was a problem hiding this comment.
I'm not sure if this is necessary: the stemmers usually received lower-cased input, no? And unicode-aware case folding generally does the right thing with Slovak.
No description provided.