Skip to content

Script for time coarsening existing training datasets#794

Open
mcgibbon wants to merge 19 commits intomainfrom
feature/time_coarsen_dataset
Open

Script for time coarsening existing training datasets#794
mcgibbon wants to merge 19 commits intomainfrom
feature/time_coarsen_dataset

Conversation

@mcgibbon
Copy link
Contributor

@mcgibbon mcgibbon commented Feb 4, 2026

This PR adds scripts/time_coarsen for time-coarsening existing training datasets. This will allow us to test the ability of our model to run with 1-day and possibly 5-day timesteps.

Note stats will also need to be re-generated following time-coarsening. This can be run separately or later incorporated into this workflow.

Changes:

  • Added scripts/time_coarsen for time-coarsening training datasets

  • Tests added


set -e

SCRIPT_PATH=$(git rev-parse --show-prefix) # relative to the root of the repository
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these dataset processing scripts should run on GCP, e.g. using argo

output=$(argo submit compute_dataset_argo_workflow.yaml \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a comment that the script should also work when using GCP, or is it a request to refactor the workflow to use GCP? If it's a request, is it blocking? And would that mean these scripts should be moved into data_process since they would then use the same infrastructure?

I had decided to go this route because we already have the datasets on weka, I found this more intuitive to launch and debug (no need to rebuild an image to test new package code), and it avoids an extra copy of the dataset in the cloud. If the idea is we would rather not use beaker CPU resources for dataset computation, or that we want a copy of these datasets to exist in the cloud, then I think it would make sense to refactor it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to refactor workflow to use GCP. This would improve consistency with other data processing workflows and avoid using GPU cluster for data processing. Also, having a copy of the daily dataset in the cloud is a good thing (space on weka is more limited, so we regularly have to delete datasets there). But not a blocking request.

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core mechanics of this script look good to me. I'm happy to review again if/when we decide to move to running this on GCP instead of Beaker.

Copy link
Contributor Author

@mcgibbon mcgibbon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

I notice our other scripts use ds.partition and lazy datasets to write @spencerkclark , are those things I should be using here? Perhaps we can refactor those tools into some fme.core.io helper functions to re-use more easily across our data processing scripts?

Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mcgibbon—a couple minor additional suggestions, but the updates look great.

I notice our other scripts use ds.partition and lazy datasets to write @spencerkclark , are those things I should be using here? Perhaps we can refactor those tools into some fme.core.io helper functions to re-use more easily across our data processing scripts?

Potentially, but I think we can cross the bridge when we get there. Since our FME image does not have dask or xpartition installed, that approach would not be readily compatible with running on Beaker. The main reason to switch to it would be if we try to use dask for laziness or to speed things up when we port this script to the cloud and find that we run into out-of-memory issues. But until that happens I see no harm in keeping things simple.

mcgibbon and others added 2 commits February 27, 2026 14:58
Co-authored-by: Spencer Clark <spencerkclark@gmail.com>
Co-authored-by: Spencer Clark <spencerkclark@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants