Skip to content

Conversation

@grassesi
Copy link
Contributor

@grassesi grassesi commented Dec 2, 2025

Description

Small utility to record timings, recorded timings will be logged into to the metrics file.

Issue Number

Is this PR a draft? Mark it as draft.

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
    • I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@clessig
Copy link
Collaborator

clessig commented Dec 7, 2025

@grassesi : could you give an example of what is logged and how one can access it (will it be on MLflow or plot_train)?

@grassesi
Copy link
Contributor Author

grassesi commented Dec 8, 2025

Is is mostly a toy application for myself, the timing logic has not been tested in a multinode application. but here is the basic logic:

  • timers are hierarchical.
  • use my_timer.record() to record a timing event or my_timer.record("subtimer") to record it on a subtimer. This will start the measurment on that timer.
  • a measurement is recorded when a) a timing event is recorded on a running timer, b) a timing event is recorded on a different "sibling" subtimer.
  • my_timer.reset() can be used to trigger a reset on a timer and all subtimers, returning statistics for each affected timer and clearing their measurments.

Here is how it works in application:

  • Two "well known" timers root.training and root.inference are used to measure what rough timing statistics of the different steps in the training and inference loops
  • For each step a subtimer records events.
  • Whenever metrics are recorded with TrainLogger the timers are reset and their recorded times are stored in a metric with name eg.: timinig.root.train.<subtimer>.<statistic>.
  • These are then treated as any other metric would.
  • timing/plot_timings.py shows how to plot from the metric file to produce a plot like this:
image

"weathergen-evaluate",
"weathergen-readers-extra"
"weathergen-readers-extra",
"pyarrow>=22.0.0",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep all our packages at the bottom

@tjhunter
Copy link
Collaborator

As discussed with @grassesi , this draft is great to understand the general scope of writing timers. We should have a quick design session before starting implementing our own implementation. A good example of all the subtleties to deal with is src/nanotron/logging/timers.py in nanotron. Maybe we should consider copy/pasting their implementation (or another one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants