Script for time coarsening existing training datasets by mcgibbon · Pull Request #794 · ai2cm/ace

mcgibbon · 2026-02-04T16:57:22Z

This PR adds scripts/time_coarsen for time-coarsening existing training datasets. This will allow us to test the ability of our model to run with 1-day and possibly 5-day timesteps.

Note stats will also need to be re-generated following time-coarsening. This can be run separately or later incorporated into this workflow.

Changes:

Added scripts/time_coarsen for time-coarsening training datasets
Tests added

oliverwm1 · 2026-02-23T18:47:54Z

scripts/time_coarsen/run.sh

+
+set -e
+
+SCRIPT_PATH=$(git rev-parse --show-prefix)  # relative to the root of the repository


I think these dataset processing scripts should run on GCP, e.g. using argo

ace/scripts/data_process/compute_dataset.sh

Line 31 in da99cfc

output=$(argo submit compute_dataset_argo_workflow.yaml \

Is this a comment that the script should also work when using GCP, or is it a request to refactor the workflow to use GCP? If it's a request, is it blocking? And would that mean these scripts should be moved into data_process since they would then use the same infrastructure?

I had decided to go this route because we already have the datasets on weka, I found this more intuitive to launch and debug (no need to rebuild an image to test new package code), and it avoids an extra copy of the dataset in the cloud. If the idea is we would rather not use beaker CPU resources for dataset computation, or that we want a copy of these datasets to exist in the cloud, then I think it would make sense to refactor it.

My preference would be to refactor workflow to use GCP. This would improve consistency with other data processing workflows and avoid using GPU cluster for data processing. Also, having a copy of the daily dataset in the cloud is a good thing (space on weka is more limited, so we regularly have to delete datasets there). But not a blocking request.

spencerkclark

The core mechanics of this script look good to me. I'm happy to review again if/when we decide to move to running this on GCP instead of Beaker.

scripts/time_coarsen/time_coarsen.py

scripts/time_coarsen/test_time_coarsen.py

mcgibbon

Updated.

I notice our other scripts use ds.partition and lazy datasets to write @spencerkclark , are those things I should be using here? Perhaps we can refactor those tools into some fme.core.io helper functions to re-use more easily across our data processing scripts?

scripts/time_coarsen/time_coarsen.py

scripts/time_coarsen/test_time_coarsen.py

spencerkclark

Thanks @mcgibbon—a couple minor additional suggestions, but the updates look great.

I notice our other scripts use ds.partition and lazy datasets to write @spencerkclark , are those things I should be using here? Perhaps we can refactor those tools into some fme.core.io helper functions to re-use more easily across our data processing scripts?

Potentially, but I think we can cross the bridge when we get there. Since our FME image does not have dask or xpartition installed, that approach would not be readily compatible with running on Beaker. The main reason to switch to it would be if we try to use dask for laziness or to speed things up when we port this script to the cloud and find that we run into out-of-memory issues. But until that happens I see no harm in keeping things simple.

scripts/time_coarsen/test_time_coarsen.py

scripts/time_coarsen/run.sh

Co-authored-by: Spencer Clark <spencerkclark@gmail.com>

mcgibbon added 13 commits February 4, 2026 16:52

add time coarsen script

3516108

add time coarsen submission script and yaml

1d46c9f

change workspace and priority

f21a46e

make run.sh runnable

29d3983

fix not-preemptible flag

d069135

remove invalid positional argument

e68a5c5

fix snapshot name

a16c124

add script to generate stats

23842e6

make script executable

c9aae9b

run get_stats directly

4f8e810

make dask optional

bcbbf5a

fix stats config maybe

de585b4

write stats to weka

bf9e693

oliverwm1 reviewed Feb 23, 2026

View reviewed changes

spencerkclark reviewed Feb 23, 2026

View reviewed changes

scripts/time_coarsen/time_coarsen.py Outdated Show resolved Hide resolved

scripts/time_coarsen/time_coarsen.py Outdated Show resolved Hide resolved

scripts/time_coarsen/test_time_coarsen.py Show resolved Hide resolved

mcgibbon added 3 commits February 24, 2026 15:50

Merge branch 'main' into feature/time_coarsen_dataset

657fc11

test chunking and sharding is carried over

990675d

write metadata to the zarr

ec01cb2

mcgibbon commented Feb 27, 2026

View reviewed changes

scripts/time_coarsen/time_coarsen.py Outdated Show resolved Hide resolved

scripts/time_coarsen/time_coarsen.py Outdated Show resolved Hide resolved

scripts/time_coarsen/test_time_coarsen.py Show resolved Hide resolved

Merge branch 'main' into feature/time_coarsen_dataset

7d80c9b

spencerkclark approved these changes Feb 27, 2026

View reviewed changes

scripts/time_coarsen/test_time_coarsen.py Outdated Show resolved Hide resolved

scripts/time_coarsen/run.sh Outdated Show resolved Hide resolved

mcgibbon and others added 2 commits February 27, 2026 14:58

Update scripts/time_coarsen/run.sh

64e4723

Co-authored-by: Spencer Clark <spencerkclark@gmail.com>

Apply suggestion from @spencerkclark

5b3d938

Co-authored-by: Spencer Clark <spencerkclark@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script for time coarsening existing training datasets#794

Script for time coarsening existing training datasets#794
mcgibbon wants to merge 19 commits intomainfrom
feature/time_coarsen_dataset

mcgibbon commented Feb 4, 2026

Uh oh!

oliverwm1 Feb 23, 2026

Uh oh!

mcgibbon Feb 24, 2026

Uh oh!

oliverwm1 Feb 25, 2026

Uh oh!

spencerkclark left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcgibbon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spencerkclark left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		set -e

		SCRIPT_PATH=$(git rev-parse --show-prefix) # relative to the root of the repository

Conversation

mcgibbon commented Feb 4, 2026

Uh oh!

oliverwm1 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mcgibbon Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

oliverwm1 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

spencerkclark left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcgibbon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spencerkclark left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

spencerkclark left a comment •

edited

Loading