refactor: generalize dataset indexing from language-based to dataset_id-based #34

federetyk · 2026-01-16T10:42:41Z

Addresses #33

Description

This PR generalizes dataset indexing within tasks from Language enum to arbitrary string identifiers (dataset_id). The current architecture limits each task to at most one dataset per language, which prevents supporting tasks with multiple monolingual datasets per language, cross-lingual datasets, or multilingual datasets.

The refactor introduces a languages_to_dataset_ids() method with a default 1:1 mapping that preserves backward compatibility for existing tasks. Tasks that require more complex dataset structures can override this method to return custom identifiers. A new get_dataset_language() method maps datasets back to their language for proper per-language result aggregation, returning None for cross-lingual or multilingual datasets.

Changes:

Rename lang_datasets: dict[Language, Dataset] to datasets: dict[str, Dataset] in Task base class
Add languages_to_dataset_ids(languages) -> list[str] method with default backward-compatible mapping
Rename load_monolingual_data(language, split) to load_dataset(dataset_id, split) across all tasks
Add get_dataset_language(dataset_id) -> Language | None method for per-language aggregation
Add language field to MetricsResult to track dataset language
Update _aggregate_per_language() to group by the language field, skipping datasets marked as cross-lingual or multilingual
Update all task implementations to use the new method signature
Add unit test for multi-dataset task scenarios
Fix minor issues in some files in examples/

All tests pass, and the output of examples/run_multiple_models.py produces results consistent with the main branch.

Checklist

Added new tests for new functionality
Tested locally with example tasks
Code follows project style guidelines
Documentation updated
No new warnings introduced

…id-based

…egation

federetyk added 5 commits January 15, 2026 11:26

refactor: generalize dataset indexing from language-based to dataset_…

b00e4c5

…id-based

fix: solve issues in example files

17b1897

fix: add language field to MetricsResult for proper per-language aggr…

e16f8dd

…egation

style: update docstrings to comply with NumPy style

e254bc2

chore: merge upstream changes (v0.3.0, task renames, test refactor)

40810c2

This was referenced Jan 16, 2026

[FEATURE] Add MELO Benchmark datasets as a ranking task for job title normalization #30

Open

feat: add new ranking tasks for melo #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: generalize dataset indexing from language-based to dataset_id-based #34

refactor: generalize dataset indexing from language-based to dataset_id-based #34

Uh oh!

federetyk commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

refactor: generalize dataset indexing from language-based to dataset_id-based #34

Are you sure you want to change the base?

refactor: generalize dataset indexing from language-based to dataset_id-based #34

Uh oh!

Conversation

federetyk commented Jan 16, 2026

Description

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant