Add Ukrainian stemmer vocabulary data#34
Conversation
- Add voc.txt (57,868 words) from Ukrainian Wikipedia dump (2025-01-02) - Add output.txt with stemmed equivalents - Add COPYING with generation instructions and CC BY-SA license Closes snowballstem/snowball#265
There was a problem hiding this comment.
Pull request overview
Adds Ukrainian stemmer test data (vocabulary + expected stemming output) derived from a Ukrainian Wikipedia dump, along with a subdirectory-specific COPYING file describing generation steps and licensing.
Changes:
- Add
ukrainian/voc.txt(Ukrainian vocabulary list from Wikipedia frequency data). - Add
ukrainian/output.txt(stemmer output corresponding line-by-line tovoc.txt). - Add
ukrainian/COPYING(generation commands, dump date, and licensing reference).
Reviewed changes
Copilot reviewed 1 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| ukrainian/COPYING | Documents how the Ukrainian voc/output data was generated and specifies licensing/source. |
| ukrainian/voc.txt | Adds the Ukrainian vocabulary dataset used for stemming validation. |
| ukrainian/output.txt | Adds the expected stemmed forms for the vocabulary list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Explicitly state that both files are licensed as CC BY-SA, as output.txt is a derivative of voc.txt.
Explicitly state the license version for Wikipedia text content.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Use full language name in stemwords invocation (-l ukrainian) - Match standard Wikipedia license phrasing
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
ukrainian/voc.txt— 57,868 words from Ukrainian Wikipedia (frequency threshold 300)ukrainian/output.txt— stemmed equivalentsukrainian/COPYING— generation instructions and CC BY-SA licenseThe vocabulary was generated from the
ukwiki-latest-pages-articles.xml.bz2dump dated 2025-01-02 usingscripts/wikipedia-dump-to-freqwithCyrillicscript and threshold 10, then filtered withscripts/freq-to-vocat threshold 300.Related PRs
Branch name:
feature/#265-ukrainian-stemmer(matches across all three repos as required by CONTRIBUTING).