log timings for different sections of training/inference loop #1394

grassesi · 2025-12-02T14:10:29Z

Description

Small utility to record timings, recorded timings will be logged into to the metrics file.

Issue Number

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

clessig · 2025-12-07T17:42:37Z

@grassesi : could you give an example of what is logged and how one can access it (will it be on MLflow or plot_train)?

grassesi · 2025-12-08T15:10:54Z

Is is mostly a toy application for myself, the timing logic has not been tested in a multinode application. but here is the basic logic:

timers are hierarchical.
use my_timer.record() to record a timing event or my_timer.record("subtimer") to record it on a subtimer. This will start the measurment on that timer.
a measurement is recorded when a) a timing event is recorded on a running timer, b) a timing event is recorded on a different "sibling" subtimer.
my_timer.reset() can be used to trigger a reset on a timer and all subtimers, returning statistics for each affected timer and clearing their measurments.

Here is how it works in application:

Two "well known" timers root.training and root.inference are used to measure what rough timing statistics of the different steps in the training and inference loops
For each step a subtimer records events.
Whenever metrics are recorded with TrainLogger the timers are reset and their recorded times are stored in a metric with name eg.: timinig.root.train.<subtimer>.<statistic>.
These are then treated as any other metric would.
timing/plot_timings.py shows how to plot from the metric file to produce a plot like this:

tjhunter · 2025-12-30T09:34:13Z

pyproject.toml

 "weathergen-evaluate",
- "weathergen-readers-extra"
+ "weathergen-readers-extra",
+ "pyarrow>=22.0.0",


keep all our packages at the bottom

tjhunter · 2025-12-30T09:51:22Z

As discussed with @grassesi , this draft is great to understand the general scope of writing timers. We should have a quick design session before starting implementing our own implementation. A good example of all the subtleties to deal with is src/nanotron/logging/timers.py in nanotron. Maybe we should consider copy/pasting their implementation (or another one).

grassesi added 13 commits December 2, 2025 15:03

Implement small timing utility

5e660fb

write timings as metric

eb06bf0

fix: use numpy to calculute metrics

469721b

fix: correctly iterate over timing.reset results

6c549e9

fix: correctly save timings as a metric

73dd745

fixes/logging

7e7bdbf

encapsulate timing->metric conversion

4acecba

fix tests

b1f7668

record substeps via parent timer

bca250e

provide predefined timers for training and validation loop

b9db029

annotate trainer for timing

2e6f9b2

timing experiment setup

5d7b315

enable polars to pandas conversion

84dc784

github-project-automation bot added this to WeatherGen-dev Dec 2, 2025

tjhunter reviewed Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

log timings for different sections of training/inference loop #1394

log timings for different sections of training/inference loop #1394

Uh oh!

grassesi commented Dec 2, 2025

Uh oh!

clessig commented Dec 7, 2025

Uh oh!

grassesi commented Dec 8, 2025 •

edited

Loading

Uh oh!

tjhunter Dec 30, 2025

Uh oh!

tjhunter commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

log timings for different sections of training/inference loop #1394

Are you sure you want to change the base?

log timings for different sections of training/inference loop #1394

Uh oh!

Conversation

grassesi commented Dec 2, 2025

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig commented Dec 7, 2025

Uh oh!

grassesi commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjhunter Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grassesi commented Dec 8, 2025 •

edited

Loading