Skip to content

Conversation

@tyoc213
Copy link
Contributor

@tyoc213 tyoc213 commented Feb 26, 2025

#42 allows to call for full subsets of https://huggingface.co/datasets/HuggingFaceTB/finemath/

(like finemath-3plus)

python data_prep/convert_dataset_hf.py \
  --dataset HuggingFaceTB/finemath --data_subset finemath-3plus \
  --out_root my-copy-math-finemath3plus --splits train_finemath_3plus val_finemath_3plus \
  --concat_tokens 1024 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>'

Note

The only thing I dont know if it is a problem I see is that there is only train in the original split guarantee they dont use the same row for training and validation sets.

I have used the other options as they where like concat tokens and tokenizer and eos text.

splits are like this:


Big splits for

  • finemath 3 plus is about 21.4M rows
  • finemath 4 plus is about 6.7M rows
  • infiwebmath 3 plus is about 13.9M rows
  • infiwebmath 4 plus is about 6.3M rows

which result in:

  • train_finemath_3plus and val_finemath_3plus (19M and 1.9M rows)
  • train_finemath_4plus and val_finemath_4plus (6M and 600k rows)
  • train_infiwebmath_3plus and val_infiwebmath_3plus (12M and 1.2M rows)
  • train_infiwebmath_4plus and val_infiwebmath_4plus (5M and 500k rows)

And also some normal "tiny" splits

  • train 3M rows
  • train_small 100k rows
  • train_xsmall 30k rows
  • train_xxsmall 10k rows

and

  • val 300k rows
  • val_small 10k rows
  • val_xsmall 3k rows
  • val_xxsmall 100 rows

@matdmiller
Copy link
Contributor

from datasets import load_dataset, DatasetDict

# Load without streaming
dataset = load_dataset("HuggingFaceTB/finemath", "finemath-4plus")

# Get the appropriate split
if "train" in dataset:
    full_dataset = dataset["train"]
else:
    full_dataset = dataset

# Split with a fixed seed
split_dataset = full_dataset.train_test_split(test_size=0.1, seed=42)

# Define paths for the new datasets
train_path = "/home/mathewmiller/projects/llm_foundry/datasets/finemath/finemath-4plus/train"
valid_path = "/home/mathewmiller/projects/llm_foundry/datasets/finemath/finemath-4plus/valid"

# # Save using 'save_to_disk' with the 'parquet' format specified
# split_dataset["train"].save_to_disk(
#     train_path, 
#     max_shard_size="300MB",  # Same chunk size as original dataset
#     storage_options={"file_format": "parquet"}
# )

# split_dataset["test"].save_to_disk(
#     valid_path, 
#     max_shard_size="300MB",  # Same chunk size as original dataset
#     storage_options={"file_format": "parquet"}
# )

combined = DatasetDict({
    "train": split_dataset["train"],
    "valid": split_dataset["test"]
})
combined.save_to_disk(
    "/home/mathewmiller/projects/llm_foundry/datasets/finemath/finemath-4plus",
    max_shard_size="300MB",
    storage_options={"file_format": "parquet"}
)

combined.push_to_hub("matdmiller/finemath-4plus-split", private=True, max_shard_size="300MB")

print("Dataset split and saved in chunked parquet format")

The script above runs successfully including the upload. It needs some additional changes though to create sub-datasets like the original finemath for the full dataset, 1M, 100k, 10k, 1k. Also need to get this to put the train and valid datasets into separate folders. I think we also want it to generate the datasets dict with dataset info on what is in the repo.

We also want to do the same thing for this dataset but we do want to remove aya from the mixture first as our models are not multi-lingual. https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture

@tyoc213
Copy link
Contributor Author

tyoc213 commented Mar 5, 2025

@matdmiller should that be on this PR or should make another one specific for tulu 3?

@tyoc213
Copy link
Contributor Author

tyoc213 commented Mar 22, 2025

To tokenize as suggested, we need to create the text column, we have two options to achieve that:

  • we can do that on the post_process of our first scriptor or
  • use convert_finetuning_dataset.py, but for the moment failing to do so as I get this errors (for this two preproc functions)
$ python data_prep/convert_finetuning_dataset.py --dataset tyoc213/split-NuminaMath-CoT  --data_subset 1k --splits train test --out_root out --tokenizer HuggingFaceTB/SmolLM2-135M --preprocessor preproc:pre_numina --num_workers 1
Converting train to MDS format...
train: 2it [00:05,  2.84s/it]
Traceback (most recent call last):
  File "/Users/devworks/github.com/llm-foundry/scripts/data_prep/convert_finetuning_dataset.py", line 116, in <module>
    convert_finetuning_dataset_from_args(
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 329, in convert_finetuning_dataset_from_args
    convert_finetuning_dataset(
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 216, in convert_finetuning_dataset
    for sample in tqdm(samples, desc=split_name):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
               ^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 84, in generate_samples
    yield {k: v[idx] for k, v in batch.items()}
              ~^^^^^
IndexError: list index out of range

removing out folder for the other dataset

python data_prep/convert_finetuning_dataset.py --dataset tyoc213/split-tulu-3-sft-olmo-2-mixture  --data_subset 1k --splits train test --out_root oooiiiuuu --tokenizer HuggingFaceTB/SmolLM2-135M --preprocessor preproc:pre_tulu --num_workers 1
Converting train to MDS format...
train: 0it [00:04, ?it/s]
Traceback (most recent call last):
  File "/Users/devworks/github.com/llm-foundry/scripts/data_prep/convert_finetuning_dataset.py", line 116, in <module>
    convert_finetuning_dataset_from_args(
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 329, in convert_finetuning_dataset_from_args
    convert_finetuning_dataset(
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 216, in convert_finetuning_dataset
    for sample in tqdm(samples, desc=split_name):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
               ^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 77, in generate_samples
    for batch in loader:
                 ^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
    data.reraise()
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/_utils.py", line 715, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 43, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/collate.py", line 398, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/collate.py", line 172, in collate
    key: collate(
         ^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/collate.py", line 207, in collate
    raise RuntimeError("each element in list of batch should be of equal size")
RuntimeError: each element in list of batch should be of equal size

@tyoc213 tyoc213 force-pushed the dataset-finemath branch 2 times, most recently from b7edac8 to 694003a Compare March 22, 2025 05:50
@tyoc213 tyoc213 force-pushed the dataset-finemath branch from b4a5e1a to ecb59fb Compare April 7, 2025 02:26
@tyoc213 tyoc213 force-pushed the dataset-finemath branch 5 times, most recently from 7152b34 to 6b41a39 Compare April 24, 2025 19:05
@tyoc213 tyoc213 force-pushed the dataset-finemath branch 3 times, most recently from f09325a to 23bb866 Compare May 31, 2025 01:42
@tyoc213 tyoc213 force-pushed the dataset-finemath branch from 7141fd2 to e462b6a Compare June 7, 2025 06:33
@tyoc213 tyoc213 force-pushed the dataset-finemath branch from f64438d to 5fc8ee2 Compare June 17, 2025 03:00
tyoc213 and others added 14 commits August 30, 2025 10:50
Big splits for

- finemath 3 plus is about 21.4M rows
- finemath 4 plus is about 6.7M rows
- infiwebmath 3 plus is about 13.9M rows
- infiwebmath 4 plus is about 6.3M rows

which result in:

* train_finemath_3plus and val_finemath_3plus
* train_finemath_4plus and val_finemath_4plus
* train_infiwebmath_3plus and val_infiwebmath_3plus
* train_infiwebmath_4plus and val_infiwebmath_4plus

And also some normal "tiny" splits

* train 3M rows
* train_small 100k rows
* train_xsmall 30k rows
* train_xxsmall 10k rows

and

* val 300k rows
* val_small 10k rows
* val_xsmall 3k rows
* val_xxsmall 100 rows
As finemath only contains the train split, we need to generate the
training and validation splits from the IterableDataset and the only
tools we have are: shuffle, skip, take.

Basically we follow this structure to create corresponding dataset:

* pairs of val and training (9 times the validation amount)
* if validation skip 9 times the amount and take the amount
* if training take the amount
Take the dataset as is.
Annotate number of rows of train/val pair
Exported DataSplitConstants, DatsetConstants and CONSTS and added
add_dataset_config in `convert_dataset_hf` to allow adding dynamically
new datasets.
Tokenization works OK with finemath as it already has the `text` column,
we can:

* post process when downloading the dataset in `split_hf_dataset.py` or
* we can call convert_finetuning_dataset and create the preprocesors at
`data_prep/preproc` and start from there (this is a WIP)
…that no errors are show because different sizes in batch (as not padded up to a length)
@tyoc213 tyoc213 closed this Nov 7, 2025
tyoc213 added a commit that referenced this pull request Nov 26, 2025
tyoc213 added a commit that referenced this pull request Nov 26, 2025
tyoc213 added a commit that referenced this pull request Nov 26, 2025
tyoc213 added a commit that referenced this pull request Nov 26, 2025
tyoc213 added a commit that referenced this pull request Nov 26, 2025
tyoc213 added a commit that referenced this pull request Nov 26, 2025
tyoc213 added a commit that referenced this pull request Dec 5, 2025
tyoc213 added a commit that referenced this pull request Dec 5, 2025
tyoc213 added a commit that referenced this pull request Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants