add ukrainian words by tggo · Pull Request #18 · snowballstem/snowball-data

tggo · 2021-03-04T14:03:10Z

No description provided.

ojwb · 2022-11-08T04:57:15Z

@stefanvodita The test vocabulary here is 19,613 words and there are no details of where it's from.

In #22 you added a test vocabulary derived using the wikipedia-most-common-words script with 282,296 words - it looks like you just used the default MIN_FREQ of 10.

The choice of length is a bit of a balance - we want to cover enough of the vocabulary that the stemmer is well exercised, and that the list is useful for seeing the effects of proposed changes to the algorithm; an overly long list needlessly increases the time tests take to run and storage requirements for anyone checking out this repo. There's not a particular target number of words as what is appropriate depends on the language's vocabulary size and how inflected it is.

I don't know much about the Ukrainian language, but the list here is short compared to most existing lists here (only nepali/voc.txt is shorter, but at 4000 words I suspect that's much too short). Your proposed list however seems on the long side compared to most languages (only tamil/voc.txt and arabic/voc.txt are longer).

My thoughts are it probably makes sense to use wikipedia-derived data instead of the list here, but we probably want to try find a higher MIN_FREQ value to use. One way to compare two MIN_FREQ values is to generate a list with each, then diff the two lists to see what words the higher MIN_FREQ value excludes. If the lower MIN_FREQ value adds mostly junk the higher MIN_FREQ is probably the better choice. Some foreign words and/or proper nouns are OK (it's useful to consider how the stemmer affects both of these after all).

stefanvodita · 2022-11-26T11:00:25Z

Hi @ojwb ! Before publishing #22 I had experimented a bit with MIN_FREQ and I had another look now.
Increasing MIN_FREQ to 11 seems to exclude mostly genuine words, but I don't have prior experience to benchmark against or knowledge of Ukrainian, so this is just my best guess.
I assume it's better to err on the site of vocabularies that are too large rather than too small, so maybe we can go with MIN_FREQ 10.

ojwb · 2023-09-26T20:56:31Z

Closing - there's current work in snowballstem/snowball#178 on a Ukrainian stemmer but if I think the vocabulary here is really too small to provide the breadth of testing we want and a longer list would be a better option (e.g. the one from #22 which was generated from wikipedia data).

add ukrainian words

f860c60

ojwb mentioned this pull request Sep 20, 2023

Add ukrainian stemmer snowballstem/snowball#178

Open

ojwb closed this Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ukrainian words#18

add ukrainian words#18
tggo wants to merge 1 commit intosnowballstem:masterfrom
tggo:master

tggo commented Mar 4, 2021

Uh oh!

ojwb commented Nov 8, 2022

Uh oh!

stefanvodita commented Nov 26, 2022

Uh oh!

ojwb commented Sep 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tggo commented Mar 4, 2021

Uh oh!

ojwb commented Nov 8, 2022

Uh oh!

stefanvodita commented Nov 26, 2022

Uh oh!

ojwb commented Sep 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants