Skip to content

Add Ukrainian stemmer vocabulary data#34

Open
polaz wants to merge 5 commits intosnowballstem:masterfrom
polaz:feature/#265-ukrainian-stemmer
Open

Add Ukrainian stemmer vocabulary data#34
polaz wants to merge 5 commits intosnowballstem:masterfrom
polaz:feature/#265-ukrainian-stemmer

Conversation

@polaz
Copy link

@polaz polaz commented Jan 30, 2026

Summary

  • Add ukrainian/voc.txt — 57,868 words from Ukrainian Wikipedia (frequency threshold 300)
  • Add ukrainian/output.txt — stemmed equivalents
  • Add ukrainian/COPYING — generation instructions and CC BY-SA license

The vocabulary was generated from the ukwiki-latest-pages-articles.xml.bz2 dump dated 2025-01-02 using scripts/wikipedia-dump-to-freq with Cyrillic script and threshold 10, then filtered with scripts/freq-to-voc at threshold 300.

Related PRs

Branch name: feature/#265-ukrainian-stemmer (matches across all three repos as required by CONTRIBUTING).

- Add voc.txt (57,868 words) from Ukrainian Wikipedia dump (2025-01-02)
- Add output.txt with stemmed equivalents
- Add COPYING with generation instructions and CC BY-SA license

Closes snowballstem/snowball#265
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Ukrainian stemmer test data (vocabulary + expected stemming output) derived from a Ukrainian Wikipedia dump, along with a subdirectory-specific COPYING file describing generation steps and licensing.

Changes:

  • Add ukrainian/voc.txt (Ukrainian vocabulary list from Wikipedia frequency data).
  • Add ukrainian/output.txt (stemmer output corresponding line-by-line to voc.txt).
  • Add ukrainian/COPYING (generation commands, dump date, and licensing reference).

Reviewed changes

Copilot reviewed 1 out of 3 changed files in this pull request and generated 2 comments.

File Description
ukrainian/COPYING Documents how the Ukrainian voc/output data was generated and specifies licensing/source.
ukrainian/voc.txt Adds the Ukrainian vocabulary dataset used for stemming validation.
ukrainian/output.txt Adds the expected stemmed forms for the vocabulary list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

polaz added 2 commits January 30, 2026 14:40
Explicitly state that both files are licensed as CC BY-SA,
as output.txt is a derivative of voc.txt.
Explicitly state the license version for Wikipedia text content.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

polaz added 2 commits January 30, 2026 18:50
- Use full language name in stemwords invocation (-l ukrainian)
- Match standard Wikipedia license phrasing
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants