refactor: generalize dataset indexing from language-based to dataset_id-based #34
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Addresses #33
Description
This PR generalizes dataset indexing within tasks from
Languageenum to arbitrary string identifiers (dataset_id). The current architecture limits each task to at most one dataset per language, which prevents supporting tasks with multiple monolingual datasets per language, cross-lingual datasets, or multilingual datasets.The refactor introduces a
languages_to_dataset_ids()method with a default 1:1 mapping that preserves backward compatibility for existing tasks. Tasks that require more complex dataset structures can override this method to return custom identifiers. A newget_dataset_language()method maps datasets back to their language for proper per-language result aggregation, returningNonefor cross-lingual or multilingual datasets.Changes:
lang_datasets: dict[Language, Dataset]todatasets: dict[str, Dataset]in Task base classlanguages_to_dataset_ids(languages) -> list[str]method with default backward-compatible mappingload_monolingual_data(language, split)toload_dataset(dataset_id, split)across all tasksget_dataset_language(dataset_id) -> Language | Nonemethod for per-language aggregationlanguagefield toMetricsResultto track dataset language_aggregate_per_language()to group by thelanguagefield, skipping datasets marked as cross-lingual or multilingualexamples/All tests pass, and the output of
examples/run_multiple_models.pyproduces results consistent with the main branch.Checklist