-
Notifications
You must be signed in to change notification settings - Fork 2
#42 add some splits for HuggingFaceTB/finemath dataset #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The script above runs successfully including the upload. It needs some additional changes though to create sub-datasets like the original finemath for the full dataset, 1M, 100k, 10k, 1k. Also need to get this to put the train and valid datasets into separate folders. I think we also want it to generate the datasets dict with dataset info on what is in the repo. We also want to do the same thing for this dataset but we do want to remove |
|
@matdmiller should that be on this PR or should make another one specific for tulu 3? |
|
To tokenize as suggested, we need to create the
$ python data_prep/convert_finetuning_dataset.py --dataset tyoc213/split-NuminaMath-CoT --data_subset 1k --splits train test --out_root out --tokenizer HuggingFaceTB/SmolLM2-135M --preprocessor preproc:pre_numina --num_workers 1
Converting train to MDS format...
train: 2it [00:05, 2.84s/it]
Traceback (most recent call last):
File "/Users/devworks/github.com/llm-foundry/scripts/data_prep/convert_finetuning_dataset.py", line 116, in <module>
convert_finetuning_dataset_from_args(
File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 329, in convert_finetuning_dataset_from_args
convert_finetuning_dataset(
File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 216, in convert_finetuning_dataset
for sample in tqdm(samples, desc=split_name):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/devworks/github.com/llm-foundry/.venv/lib/python3.12/site-packages/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
^^^^^^^^
File "/Users/devworks/github.com/llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py", line 84, in generate_samples
yield {k: v[idx] for k, v in batch.items()}
~^^^^^
IndexError: list index out of rangeremoving |
b7edac8 to
694003a
Compare
b4a5e1a to
ecb59fb
Compare
7152b34 to
6b41a39
Compare
f09325a to
23bb866
Compare
f64438d to
5fc8ee2
Compare
Big splits for - finemath 3 plus is about 21.4M rows - finemath 4 plus is about 6.7M rows - infiwebmath 3 plus is about 13.9M rows - infiwebmath 4 plus is about 6.3M rows which result in: * train_finemath_3plus and val_finemath_3plus * train_finemath_4plus and val_finemath_4plus * train_infiwebmath_3plus and val_infiwebmath_3plus * train_infiwebmath_4plus and val_infiwebmath_4plus And also some normal "tiny" splits * train 3M rows * train_small 100k rows * train_xsmall 30k rows * train_xxsmall 10k rows and * val 300k rows * val_small 10k rows * val_xsmall 3k rows * val_xxsmall 100 rows
As finemath only contains the train split, we need to generate the training and validation splits from the IterableDataset and the only tools we have are: shuffle, skip, take. Basically we follow this structure to create corresponding dataset: * pairs of val and training (9 times the validation amount) * if validation skip 9 times the amount and take the amount * if training take the amount
Take the dataset as is. Annotate number of rows of train/val pair
…nd ablations of 100k, 10k, 1k
Exported DataSplitConstants, DatsetConstants and CONSTS and added add_dataset_config in `convert_dataset_hf` to allow adding dynamically new datasets.
Tokenization works OK with finemath as it already has the `text` column, we can: * post process when downloading the dataset in `split_hf_dataset.py` or * we can call convert_finetuning_dataset and create the preprocesors at `data_prep/preproc` and start from there (this is a WIP)
… stand alone from scripts folder
…ng that sets a row per sample in concat tokens path
b4226db to
9f26f55
Compare
…that no errors are show because different sizes in batch (as not padded up to a length)
9f26f55 to
d768c4f
Compare
#42 allows to call for full subsets of https://huggingface.co/datasets/HuggingFaceTB/finemath/
(like
finemath-3plus)Note
The only thing I dont know if it is a problem I see is that there is only
trainin the original split guarantee they dont use the same row for training and validation sets.I have used the other options as they where like concat tokens and tokenizer and eos text.
splits are like this:
Big splits for
which result in:
And also some normal "tiny" splits
and