Skip to content

Conversation

@enssow
Copy link
Contributor

@enssow enssow commented Dec 11, 2025

Description

Draft PR to showcase how to enable sharding (experimentation underway to optimise performance)

Issue Number

DRAFT
Closes #1384

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@github-actions github-actions bot added infra Issues related to infrastructure performance Work related to performance improvements labels Dec 11, 2025
@enssow enssow marked this pull request as draft December 11, 2025 14:22
@enssow enssow mentioned this pull request Dec 11, 2025
6 tasks
@enssow
Copy link
Contributor Author

enssow commented Dec 17, 2025

Latest version can read and write with the zipstore format but needs tidying up (produces lots of print statements to narrow down the issues when trying to run the evaluation from a .yml file)

@enssow
Copy link
Contributor Author

enssow commented Dec 19, 2025

  • Specify type of store to create during an inference with uv run --offline inference --from_run_id zs581tqh --samples 1 --streams_output ERA5 --zarr-store local or uv run --offline inference --from_run_id zs581tqh --samples 1 --streams_output ERA5 --zarr-store zip
  • Code to run evaluation remains the same with no extra sections of config required (the type of store is detected using the extension) e.g. uv run evaluate --config ../test_evaluate_zip.yml (I tested the latest code with runs: v6l27pog (zip) and bvu0y897 (local)
  • Backwards Compatible: /p/project1/weatherai/owens1/WeatherGenerator/.venv/lib/python3.12/site-packages/zarr/core/group.py:568: ZarrUserWarning: Both zarr.json (Zarr format 3) and .zgroup (Zarr format 2) metadata objects exist at file:///p/scratch/weatherai/shared_work/results/v8pqzh4y/validation_chkpt00000_rank0000.zarr. Zarr format 3 will be used. warnings.warn(msg, category=ZarrUserWarning, stacklevel=1) get this warning but it does plot (tested with run v8pqzh4y)

@enssow
Copy link
Contributor Author

enssow commented Dec 19, 2025

Integration test fails, need to investigate this further!

@clessig
Copy link
Collaborator

clessig commented Dec 20, 2025

Latest version can read and write with the zipstore format but needs tidying up (produces lots of print statements to narrow down the issues when trying to run the evaluation from a .yml file)

What is the performance impact for running evaluation with using zipstore?

Copy link
Collaborator

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small style comments. Otherwise, it looks ready to try out on a larger scale.

When testing into the logic, I moved the creation of the zarr store straight into inference. Otherwise, I am personally confused when we write during inference and when we just calculate validation losses.

@grassesi : do you know why we don't write data when calling validation during training?

"""Convert raw dask arrays into chunked dask-aware xarray dataset."""
chunks = (chunk_nsamples, *self.data.shape[1:])

chunks = (chunk_nsamples, *(max(x // 4, 1) for x in self.data.shape[1:]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this formula coming from? Someone else should be able to update it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted out into a scale factor parameter now for easier updating

self.data_root: zarr.Group | None = None
self._store: LocalStore | None = None
# determine type using extension
ext = self._store_path.suffix[1:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are repeating yourself

if self.type == "zip":
if self.create:
_logger.info("Creating zipstore")
self._store = ZipStore(self._store_path, mode="a")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append is more permissive, but there is a cost in opening and closing the store. I would rather have the store opened for the full duration of the write rather than for each chunk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode = "w" works with the new develop code, before I had some issues with overwriting so was using append - thanks for fixing this :)

@clessig
Copy link
Collaborator

clessig commented Dec 29, 2025

Some small style comments. Otherwise, it looks ready to try out on a larger scale.

When testing into the logic, I moved the creation of the zarr store straight into inference. Otherwise, I am personally confused when we write during inference and when we just calculate validation losses.

@grassesi : do you know why we don't write data when calling validation during training?

We don't write during validation since it's slows down the validation, generates additional files/storage, and is rarely used. There's a config parameter that controls this:

log_validation: 0
(in the process of being renamed).

@github-actions github-actions bot added the initiative Large piece of work covering multiple sprint label Dec 31, 2025
@enssow
Copy link
Contributor Author

enssow commented Dec 31, 2025

Latest version can read and write with the zipstore format but needs tidying up (produces lots of print statements to narrow down the issues when trying to run the evaluation from a .yml file)

What is the performance impact for running evaluation with using zipstore?

Hi Christian, sorry I missed this before the holidays - the latest tests are all in the hedge doc: https://gitlab.jsc.fz-juelich.de/hedgedoc/3X10vtdVQumQZy4qpBiHWg - there seems to be an small imporevemnt when using the ZipStore to evaluate

@enssow enssow marked this pull request as ready for review January 2, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infra Issues related to infrastructure initiative Large piece of work covering multiple sprint performance Work related to performance improvements

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

zarr3 compaction

5 participants