Skip to content

add ukrainian words#18

Closed
tggo wants to merge 1 commit intosnowballstem:masterfrom
tggo:master
Closed

add ukrainian words#18
tggo wants to merge 1 commit intosnowballstem:masterfrom
tggo:master

Conversation

@tggo
Copy link

@tggo tggo commented Mar 4, 2021

No description provided.

@ojwb
Copy link
Member

ojwb commented Nov 8, 2022

@stefanvodita The test vocabulary here is 19,613 words and there are no details of where it's from.

In #22 you added a test vocabulary derived using the wikipedia-most-common-words script with 282,296 words - it looks like you just used the default MIN_FREQ of 10.

The choice of length is a bit of a balance - we want to cover enough of the vocabulary that the stemmer is well exercised, and that the list is useful for seeing the effects of proposed changes to the algorithm; an overly long list needlessly increases the time tests take to run and storage requirements for anyone checking out this repo. There's not a particular target number of words as what is appropriate depends on the language's vocabulary size and how inflected it is.

I don't know much about the Ukrainian language, but the list here is short compared to most existing lists here (only nepali/voc.txt is shorter, but at 4000 words I suspect that's much too short). Your proposed list however seems on the long side compared to most languages (only tamil/voc.txt and arabic/voc.txt are longer).

My thoughts are it probably makes sense to use wikipedia-derived data instead of the list here, but we probably want to try find a higher MIN_FREQ value to use. One way to compare two MIN_FREQ values is to generate a list with each, then diff the two lists to see what words the higher MIN_FREQ value excludes. If the lower MIN_FREQ value adds mostly junk the higher MIN_FREQ is probably the better choice. Some foreign words and/or proper nouns are OK (it's useful to consider how the stemmer affects both of these after all).

@stefanvodita
Copy link

Hi @ojwb ! Before publishing #22 I had experimented a bit with MIN_FREQ and I had another look now.
Increasing MIN_FREQ to 11 seems to exclude mostly genuine words, but I don't have prior experience to benchmark against or knowledge of Ukrainian, so this is just my best guess.
I assume it's better to err on the site of vocabularies that are too large rather than too small, so maybe we can go with MIN_FREQ 10.

@ojwb
Copy link
Member

ojwb commented Sep 26, 2023

Closing - there's current work in snowballstem/snowball#178 on a Ukrainian stemmer but if I think the vocabulary here is really too small to provide the breadth of testing we want and a longer list would be a better option (e.g. the one from #22 which was generated from wikipedia data).

@ojwb ojwb closed this Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants