Using Zarr as a processing step to improve IO #126
Replies: 2 comments 1 reply
-
|
Further speed improvement could be seek by adding Numba compiler on the various simple computation. For instance, convert the radiation code to numpy and add Numba compiler. An example here mixing: https://examples.dask.org/applications/stencils-with-numba.html official documentation: https://numba.pydata.org/ |
Beta Was this translation helpful? Give feedback.
-
|
The current work seems to be running well. I tested downscaling 1960-2024 timeseries for 8 points in a single shot, it took less than a minute on a large server. The server's capacity were close to fully used. This was not even possible with the previous implementation. I now added a tool to convert a stack of netcdf into a zarr. Conversion from netcdf to zarr can be tricky. @joelfiddes you may wanna check a bit what I coded. I am about to merge my dev branch |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As part of the branch
tps_zarrI implemented a conversion of ERA5 netcdf to a Zarr store. Then I developped two parallelization workflow:Finally, downscaled data are stored as netcdf, or zarr (option only available using Dask).
Quick benchmarking on my local machine shows really good improvement (50 points to downscale, 1.5years) :
I am not sure how this scales up on larger computer or servers, but the setting of Dask change a fair amount the speed and memory usage. Dask and outputing to netcdf is wortless in terms of time.
TimeSplitter is not implemented with this version. This is where the workflow Zarr -> Dask -> Zarr may shine and handle very large dataset better than manual splitting using TimeSplitter and multicore. This has not been tested yet.
Warning: config file has been updated to include more options. Will be ported to documentation if merged to
mainBeta Was this translation helpful? Give feedback.
All reactions