Conversation
|
@stefanvodita The test vocabulary here is 19,613 words and there are no details of where it's from. In #22 you added a test vocabulary derived using the The choice of length is a bit of a balance - we want to cover enough of the vocabulary that the stemmer is well exercised, and that the list is useful for seeing the effects of proposed changes to the algorithm; an overly long list needlessly increases the time tests take to run and storage requirements for anyone checking out this repo. There's not a particular target number of words as what is appropriate depends on the language's vocabulary size and how inflected it is. I don't know much about the Ukrainian language, but the list here is short compared to most existing lists here (only My thoughts are it probably makes sense to use wikipedia-derived data instead of the list here, but we probably want to try find a higher |
|
Hi @ojwb ! Before publishing #22 I had experimented a bit with MIN_FREQ and I had another look now. |
|
Closing - there's current work in snowballstem/snowball#178 on a Ukrainian stemmer but if I think the vocabulary here is really too small to provide the breadth of testing we want and a longer list would be a better option (e.g. the one from #22 which was generated from wikipedia data). |
No description provided.