Added a datatrove based pipeline for filtering tokenized data using scores. #235

BlueCrescent · 2025-07-25T08:39:59Z

Included an example configuration file.
Added datatrove and pydantic-settings to requirements.
Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

…ized data using scores. - Included an example configuration file. - Added datatrove and pydantic-settings to requirements. - Note that modalities is also required for the pipeline to work, but it is not included in the requirements file.

Copilot

Pull Request Overview

This PR implements a data filtering pipeline using datatrove for filtering tokenized data based on scores. The pipeline processes JSONL files containing scores for data samples and filters corresponding tokenized datasets based on configurable thresholds.

Adds a complete datatrove-based filtering pipeline with score parsing and data filtering components
Introduces configuration management using pydantic-settings for both local and Slurm execution environments
Updates dependencies to include datatrove and pydantic-settings

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py	Implements ScoresParser class for reading JSONL score files and mapping to tokenized data
src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py	Implements DataFiltering class for filtering datasets based on score thresholds
src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py	Main pipeline orchestration with configuration management and execution settings
pyproject.toml	Adds datatrove and pydantic-settings dependencies
configs/data_processing/example_filter_pipeline_config.yaml	Example configuration file for the filtering pipeline

Comments suppressed due to low confidence (1)

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py:241

[nitpick] The error message could be more helpful by providing an example of how to use the FilterPipelineBuilder class directly or where to find documentation.

            "and use the FilterPipelineBuilder class directly."

src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…g pipeline and adapted the codebase for new changes from main

…tionality

… execution settings

…line.py

…dle duplicates in score parsing

AbasKhan · 2025-11-04T11:55:55Z

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

+        document = self.get_document_from_dict(doc_content, filepath, 0)
+        return [document]
+
+    def _parse_scores_jsonl_file(self, filepath: str) -> tuple[str, list[dict[str, float]]]:


the scores are emitted in lexicographic order of the document IDs. IDs such as sample1, sample2, sample10 will be reordered to sample1, sample10, sample2, so the thresholds get applied to the wrong rows in the packed dataset. Please preserve the original file order (e.g. rely on insertion order or track the original line index when deduplicating).

Fixed in a0698c2

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

tests/score_based_filtering/test_filter_pipeline.py

AbasKhan · 2025-11-11T12:02:47Z

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

+        output_folder (Path): The folder where the filtered datasets will be saved.
+        thresholds (dict[str, float]): A dictionary where keys are score names and values are the
+            thresholds to filter samples.
+        hash_to_base_file_mapping_csv (Path): A CSV file mapping base file hashes to their corresponding paths.


Seems like an artifact

Removed in f2e8f24

…ld_pipeline function

…arser

src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py

AbasKhan · 2025-12-10T17:39:36Z

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

+
+        sbatch_args = values.get("sbatch_args") or {}
+        if isinstance(sbatch_args, _DictConfig):
+            sbatch_args = OmegaConf.to_container(sbatch_args, resolve=True)  # type: ignore[arg-type]


Will this not throw an error ?, unless you import OmegaConf ?

OmegaConf is imported.

Hmmm ? from omegaconf import DictConfig as _DictConfig does not import OmegaConf. I am not sure why the code is not throwing an error here, from omegaconf import DictConfig as _DictConfig, OmegaConf should be the way

Fixed in e791792

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

AbasKhan · 2025-12-10T17:50:41Z

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

@@ -0,0 +1,173 @@
+import json
+import logging


Unused import, please remove it

Removed in 85e3f5c

AbasKhan · 2025-12-10T17:52:03Z

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

+        """
+        Maps a base file path to the corresponding tokenized data path.
+        Args:
+            base_file_path (str): The path of the base file.


Please update the docstrings to reflect the new changes

Fixed in 1c3656c

AbasKhan · 2025-12-10T17:56:18Z

tests/score_based_filtering/test_filter_pipeline.py

+_TOKENIZER_CACHE: dict[str, Any] = {}
+
+HEADER_SIZE = 64  # Mimics EmbeddedStreamData.HEADER_SIZE_IN_BYTES (simplified for tests)
+DATA_SECTION_LEN_BYTES = 8


Unsed constants DATA_SECTION_LEN_BYTES and TOKEN_SIZE_DESC_LEN_BYTES

Removed in 1c3656c

AbasKhan · 2025-12-10T18:00:46Z

src/ml_filter/data_processing/score_based_filtering/step_data_filtering.py

+    from modalities.dataloader.filter_packed_data import filter_dataset
+except ImportError:
+    logging.error("The filtering pipeline requires the 'modalities' package to be installed.")
+    exit(1)


using exit(1) is not ideal , i would say something like

try: from modalities.dataloader.filter_packed_data import filter_dataset except ImportError as exc: raise ImportError( "The filtering pipeline requires the optional dependency 'modalities'. " "Install it via `pip install modalities` and try again." ) from exc

would be better

Fixed in 379df23

AbasKhan · 2025-12-10T18:01:42Z

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

+    """
+
+    name = "ScoresParser"
+    # type = "Parser"


Please remove this line altogether

Fixed in 379df23

src/ml_filter/data_processing/score_based_filtering/step_score_parsing.py

…ta filtering

AbasKhan

Apart from a minor change , rest looks really good . Well done

AbasKhan · 2025-12-11T01:02:56Z

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

+
+        sbatch_args = values.get("sbatch_args") or {}
+        if isinstance(sbatch_args, _DictConfig):
+            sbatch_args = OmegaConf.to_container(sbatch_args, resolve=True)  # type: ignore[arg-type]


Hmmm ? from omegaconf import DictConfig as _DictConfig does not import OmegaConf. I am not sure why the code is not throwing an error here, from omegaconf import DictConfig as _DictConfig, OmegaConf should be the way

AbasKhan · 2025-12-11T12:29:06Z

src/ml_filter/data_processing/score_based_filtering/filter_pipeline.py

+    ]
+    return pipeline
+
+if __name__ == "__main__":


Do we need this here ?, I think we should have a entry point in main.py rather

Added in e19f4a0

… YAML config

AbasKhan

I think we can merge it. But, I would suggest to add Mehdi or Max and second reviewer

BlueCrescent requested a review from Copilot July 25, 2025 08:39

Copilot AI reviewed Jul 25, 2025

View reviewed changes

BlueCrescent and others added 7 commits July 25, 2025 10:43

chore(filtering): More robust doc id parsing.

81aafa8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix(filtering): Removed duplicate file exists check.

b1d1a46

fix(filtering): fixed docstring

af89182

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'master' into filtering_pipeline

12fbc95

refactor: removed reliance on file hashes in the score-based filterin…

22dddeb

…g pipeline and adapted the codebase for new changes from main

test: add comprehensive tests for score-based filtering pipeline func…

e2d02f2

…tionality

chore: remove hardcoded YAML file path from main execution block

936462a

ajude2s requested a review from AbasKhan October 29, 2025 20:47

feat: add Slurm configuration files for filtering pipeline and update…

6bb08f7

… execution settings

ajude2s self-assigned this Nov 2, 2025

ajude2s added 2 commits November 4, 2025 12:56

refactor: clean up imports and remove unused code in test_filter_pipe…

3a5c21e

…line.py

fix: enhance ScoresParser to preserve original document order and han…

a0698c2

…dle duplicates in score parsing

AbasKhan requested changes Nov 11, 2025

View reviewed changes

ajude2s added 3 commits December 9, 2025 10:43

chore: remove unused parameter hash_to_base_file_mapping_csv from bui…

f2e8f24

…ld_pipeline function

fix: improve duplicate handling and document ID processing in ScoresP…

94904db

…arser

fix: correct file path handling in ScoresParser methods

18f0daa

ajude2s requested a review from AbasKhan December 10, 2025 11:47

AbasKhan requested changes Dec 10, 2025

View reviewed changes

ajude2s added 3 commits December 10, 2025 21:28

chore: remove unused logging import in step_score_parsing.py

85e3f5c

fix: improve error handling for missing 'modalities' dependency in da…

379df23

…ta filtering

fix: update documentation for file path mapping in ScoresParser

1c3656c

AbasKhan requested changes Dec 11, 2025

View reviewed changes

fix: normalize sbatch_args handling in SlurmExecutionSettings

e791792

AbasKhan requested changes Dec 11, 2025

View reviewed changes

ajude2s added 2 commits December 11, 2025 14:08

feat: add CLI command for running score-based filtering pipeline with…

e19f4a0

… YAML config

refactor: remove main script execution block from filter_pipeline.py

1884781

ajude2s requested a review from AbasKhan December 11, 2025 14:31

AbasKhan approved these changes Dec 12, 2025

View reviewed changes

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Are you sure you want to change the base?

Added a datatrove based pipeline for filtering tokenized data using scores. #235

Uh oh!

Conversation

BlueCrescent commented Jul 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AbasKhan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AbasKhan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants