#42 add some splits for HuggingFaceTB/finemath dataset #43

tyoc213 · 2025-02-26T00:39:47Z

#42 allows to call for full subsets of https://huggingface.co/datasets/HuggingFaceTB/finemath/

(like finemath-3plus)

python data_prep/convert_dataset_hf.py \
  --dataset HuggingFaceTB/finemath --data_subset finemath-3plus \
  --out_root my-copy-math-finemath3plus --splits train_finemath_3plus val_finemath_3plus \
  --concat_tokens 1024 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>'

Note

The only thing I dont know if it is a problem I see is that there is only train in the original split guarantee they dont use the same row for training and validation sets.

I have used the other options as they where like concat tokens and tokenizer and eos text.

splits are like this:

Big splits for

finemath 3 plus is about 21.4M rows
finemath 4 plus is about 6.7M rows
infiwebmath 3 plus is about 13.9M rows
infiwebmath 4 plus is about 6.3M rows

which result in:

train_finemath_3plus and val_finemath_3plus (19M and 1.9M rows)
train_finemath_4plus and val_finemath_4plus (6M and 600k rows)
train_infiwebmath_3plus and val_infiwebmath_3plus (12M and 1.2M rows)
train_infiwebmath_4plus and val_infiwebmath_4plus (5M and 500k rows)

And also some normal "tiny" splits

train 3M rows
train_small 100k rows
train_xsmall 30k rows
train_xxsmall 10k rows

and

val 300k rows
val_small 10k rows
val_xsmall 3k rows
val_xxsmall 100 rows

matdmiller · 2025-03-02T04:09:31Z

from datasets import load_dataset, DatasetDict

# Load without streaming
dataset = load_dataset("HuggingFaceTB/finemath", "finemath-4plus")

# Get the appropriate split
if "train" in dataset:
    full_dataset = dataset["train"]
else:
    full_dataset = dataset

# Split with a fixed seed
split_dataset = full_dataset.train_test_split(test_size=0.1, seed=42)

# Define paths for the new datasets
train_path = "/home/mathewmiller/projects/llm_foundry/datasets/finemath/finemath-4plus/train"
valid_path = "/home/mathewmiller/projects/llm_foundry/datasets/finemath/finemath-4plus/valid"

# # Save using 'save_to_disk' with the 'parquet' format specified
# split_dataset["train"].save_to_disk(
#     train_path, 
#     max_shard_size="300MB",  # Same chunk size as original dataset
#     storage_options={"file_format": "parquet"}
# )

# split_dataset["test"].save_to_disk(
#     valid_path, 
#     max_shard_size="300MB",  # Same chunk size as original dataset
#     storage_options={"file_format": "parquet"}
# )

combined = DatasetDict({
    "train": split_dataset["train"],
    "valid": split_dataset["test"]
})
combined.save_to_disk(
    "/home/mathewmiller/projects/llm_foundry/datasets/finemath/finemath-4plus",
    max_shard_size="300MB",
    storage_options={"file_format": "parquet"}
)

combined.push_to_hub("matdmiller/finemath-4plus-split", private=True, max_shard_size="300MB")

print("Dataset split and saved in chunked parquet format")

The script above runs successfully including the upload. It needs some additional changes though to create sub-datasets like the original finemath for the full dataset, 1M, 100k, 10k, 1k. Also need to get this to put the train and valid datasets into separate folders. I think we also want it to generate the datasets dict with dataset info on what is in the repo.

We also want to do the same thing for this dataset but we do want to remove aya from the mixture first as our models are not multi-lingual. https://huggingface.co/datasets/allenai/tulu-3-sft-olmo-2-mixture

tyoc213 · 2025-03-05T05:55:51Z

@matdmiller should that be on this PR or should make another one specific for tulu 3?

tyoc213 · 2025-03-22T05:24:59Z

To tokenize as suggested, we need to create the text column, we have two options to achieve that:

we can do that on the post_process of our first scriptor or
use convert_finetuning_dataset.py, but for the moment failing to do so as I get this errors (for this two preproc functions)

$ python data_prep/convert_finetuning_dataset.py --dataset tyoc213/split-NuminaMath-CoT  --data_subset 1k --splits train test --out_root out --tokenizer HuggingFaceTB/SmolLM2-135M --preprocessor preproc:pre_numina --num_workers 1
Converting train to MDS format...
train: 2it [00:05,  2.84s/it]
Traceback (most recent call last):
  File "/Users/devworks/github.com/llm-foundry/scripts/data_prep/convert_finetuning_dataset.py", line 116, in <module>
    convert_finetuning_dataset_from_args(
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 329, in convert_finetuning_dataset_from_args
    convert_finetuning_dataset(
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 216, in convert_finetuning_dataset
    for sample in tqdm(samples, desc=split_name):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
               ^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 84, in generate_samples
    yield {k: v[idx] for k, v in batch.items()}
              ~^^^^^
IndexError: list index out of range

removing out folder for the other dataset

python data_prep/convert_finetuning_dataset.py --dataset tyoc213/split-tulu-3-sft-olmo-2-mixture  --data_subset 1k --splits train test --out_root oooiiiuuu --tokenizer HuggingFaceTB/SmolLM2-135M --preprocessor preproc:pre_tulu --num_workers 1
Converting train to MDS format...
train: 0it [00:04, ?it/s]
Traceback (most recent call last):
  File "/Users/devworks/github.com/llm-foundry/scripts/data_prep/convert_finetuning_dataset.py", line 116, in <module>
    convert_finetuning_dataset_from_args(
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 329, in convert_finetuning_dataset_from_args
    convert_finetuning_dataset(
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 216, in convert_finetuning_dataset
    for sample in tqdm(samples, desc=split_name):
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
               ^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 77, in generate_samples
    for batch in loader:
                 ^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
    data.reraise()
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/_utils.py", line 715, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 43, in fetch
    return self.collate_fn(data)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/collate.py", line 398, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/collate.py", line 172, in collate
    key: collate(
         ^^^^^^^^
  File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/torch/utils/data/_utils/collate.py", line 207, in collate
    raise RuntimeError("each element in list of batch should be of equal size")
RuntimeError: each element in list of batch should be of equal size

Big splits for - finemath 3 plus is about 21.4M rows - finemath 4 plus is about 6.7M rows - infiwebmath 3 plus is about 13.9M rows - infiwebmath 4 plus is about 6.3M rows which result in: * train_finemath_3plus and val_finemath_3plus * train_finemath_4plus and val_finemath_4plus * train_infiwebmath_3plus and val_infiwebmath_3plus * train_infiwebmath_4plus and val_infiwebmath_4plus And also some normal "tiny" splits * train 3M rows * train_small 100k rows * train_xsmall 30k rows * train_xxsmall 10k rows and * val 300k rows * val_small 10k rows * val_xsmall 3k rows * val_xxsmall 100 rows

As finemath only contains the train split, we need to generate the training and validation splits from the IterableDataset and the only tools we have are: shuffle, skip, take. Basically we follow this structure to create corresponding dataset: * pairs of val and training (9 times the validation amount) * if validation skip 9 times the amount and take the amount * if training take the amount

Take the dataset as is. Annotate number of rows of train/val pair

…, 100k, 10k, 1k

…nd ablations of 100k, 10k, 1k

…lu and numina

Exported DataSplitConstants, DatsetConstants and CONSTS and added add_dataset_config in `convert_dataset_hf` to allow adding dynamically new datasets.

Tokenization works OK with finemath as it already has the `text` column, we can: * post process when downloading the dataset in `split_hf_dataset.py` or * we can call convert_finetuning_dataset and create the preprocesors at `data_prep/preproc` and start from there (this is a WIP)

… stand alone from scripts folder

…lied

…ng that sets a row per sample in concat tokens path

…that no errors are show because different sizes in batch (as not padded up to a length)

refactor #43

original PR #43

tyoc213 force-pushed the dataset-finemath branch 2 times, most recently from b7edac8 to 694003a Compare March 22, 2025 05:50

tyoc213 force-pushed the dataset-finemath branch from b4a5e1a to ecb59fb Compare April 7, 2025 02:26

tyoc213 force-pushed the dataset-finemath branch 5 times, most recently from 7152b34 to 6b41a39 Compare April 24, 2025 19:05

tyoc213 force-pushed the dataset-finemath branch 3 times, most recently from f09325a to 23bb866 Compare May 31, 2025 01:42

tyoc213 force-pushed the dataset-finemath branch from 7141fd2 to e462b6a Compare June 7, 2025 06:33

tyoc213 force-pushed the dataset-finemath branch from f64438d to 5fc8ee2 Compare June 17, 2025 03:00

tyoc213 and others added 14 commits August 30, 2025 10:50

donf't shuffle

d416b7b

Take the dataset as is. Annotate number of rows of train/val pair

Create test and validation for finemath-4plus and generate sets of 1M…

aaa61de

…, 100k, 10k, 1k

revert changes to convert_dataset_hf.py to generate finemath splits

ff706b3

Add some logs and file checks

4a0a213

fix missing import and path check

2a5820e

Generate train/val for tulu-3-sft-olmo-2-mixture without aya source a…

c29df23

…nd ablations of 100k, 10k, 1k

starting to work on split_hf_datasets.py

d00bf8f

Refactor to single file and multiple calls for datasets: finemath, tu…

5d4d2f7

…lu and numina

Adding configurations to allow script tokenize new datasets

367393d

Exported DataSplitConstants, DatsetConstants and CONSTS and added add_dataset_config in `convert_dataset_hf` to allow adding dynamically new datasets.

little cleanup

3afadc7

numina can now be tokenized, tulu is still failing

d750861

tyoc213 added 14 commits August 30, 2025 11:25

add constants directly in convert_dataset_hf so that it can be called…

6cdf2ca

… stand alone from scripts folder

tokenize each row and padd it up to max_length

8932988

add glaive preprocessing after_pull on original dataset

1bc9d97

use tokenizer if tokenizer is specified without concat tokens

5ae1dd5

pretrain set to use tokenizer with concat tokens = None

46eb8a3

adding system prompt chat template

21b52a4

numina and tulu preproc after_pull

798e838

built_tokenizer when tokenizer exist for pretraining dataset

7a12c49

set correct max_seq_len for instruct and pretrain datasets

bbbeddb

preprocs methods need to return the colums that have the template app…

a1fed85

…lied

concat_tokens=max_seq_len for pretrain datasets

feb9513

remove dead code

33bda53

extend timeout to process datasets, fix instruct tokenizer, fix paddi…

041433b

…ng that sets a row per sample in concat tokens path

no_wrap=True to trim samples for pretrain data

b3e86e6

tyoc213 force-pushed the dataset-finemath branch from b4226db to 9f26f55 Compare August 30, 2025 17:26

force enter loop if any token present and break; make batchsize=1 so …

d768c4f

…that no errors are show because different sizes in batch (as not padded up to a length)

tyoc213 force-pushed the dataset-finemath branch from 9f26f55 to d768c4f Compare August 30, 2025 20:10

add little more info on how to use

4dccc9d

tyoc213 added a commit that referenced this pull request Nov 7, 2025

Preprocess instruct datasets and tokenize instruct and pretrain datasets

c0775aa

refactor #43

tyoc213 mentioned this pull request Nov 7, 2025

preprocess and tokenize datasets #58

Open

tyoc213 closed this Nov 7, 2025

tyoc213 added a commit that referenced this pull request Nov 26, 2025

clean rewrite for preprocessing data and tokenization

7804014

original PR #43

tyoc213 added a commit that referenced this pull request Nov 26, 2025

clean rewrite for preprocessing data and tokenization

9372c0f

original PR #43

tyoc213 added a commit that referenced this pull request Nov 26, 2025

clean rewrite for preprocessing data and tokenization

1bf2ce4

original PR #43

tyoc213 added a commit that referenced this pull request Nov 26, 2025

clean rewrite for preprocessing data and tokenization

bfa1c0b

original PR #43

tyoc213 added a commit that referenced this pull request Nov 26, 2025

clean rewrite for preprocessing data and tokenization

ec5b555

original PR #43

tyoc213 added a commit that referenced this pull request Nov 26, 2025

clean rewrite for preprocessing data and tokenization

b1d64ef

original PR #43

tyoc213 added a commit that referenced this pull request Dec 5, 2025

clean rewrite for preprocessing data and tokenization

f9c5f49

original PR #43

tyoc213 added a commit that referenced this pull request Dec 5, 2025

clean rewrite for preprocessing data and tokenization

b35efc0

original PR #43

tyoc213 added a commit that referenced this pull request Feb 5, 2026

clean rewrite for preprocessing data and tokenization

33f5009

original PR #43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#42 add some splits for HuggingFaceTB/finemath dataset #43

#42 add some splits for HuggingFaceTB/finemath dataset #43

Uh oh!

tyoc213 commented Feb 26, 2025

Uh oh!

matdmiller commented Mar 2, 2025

Uh oh!

tyoc213 commented Mar 5, 2025

Uh oh!

tyoc213 commented Mar 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

#42 add some splits for HuggingFaceTB/finemath dataset #43

#42 add some splits for HuggingFaceTB/finemath dataset #43

Uh oh!

Conversation

tyoc213 commented Feb 26, 2025

Uh oh!

matdmiller commented Mar 2, 2025

Uh oh!

tyoc213 commented Mar 5, 2025

Uh oh!

tyoc213 commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyoc213 commented Mar 22, 2025 •

edited

Loading