In train_diloco_torch.py, the validation set is loaded with streaming=True format.
This means when evaluating, the process will continue infinitely since IterableDataset does not have len
ds = (
load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)
if c4_tiny
else load_dataset(
"allenai/c4",
"en",
streaming=True,
data_files={
"train": "en/c4-train.*.json.gz",
"validation": "en/c4-validation.00000-of-00008.json.gz",
},
)
)
We can use 1000 samples to test perplexity, or we can just simply load the validation dataset with streaming=False.