Streaming validation dataset will lead to infinite loop

In train_diloco_torch.py, the validation set is loaded with streaming=True format.
This means when evaluating, the process will continue infinitely since IterableDataset does not have __len__

```python
ds = (
        load_dataset("PrimeIntellect/c4-tiny", "en", ignore_verifications=True)
        if c4_tiny
        else load_dataset(
            "allenai/c4",
            "en",
            streaming=True,
            data_files={
                "train": "en/c4-train.*.json.gz",
                "validation": "en/c4-validation.00000-of-00008.json.gz",
            },
        )
    )
```

We can use 1000 samples to test perplexity, or we can just simply load the validation dataset with streaming=False.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming validation dataset will lead to infinite loop #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Streaming validation dataset will lead to infinite loop #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions