Add benchmark for SongUnetv2 by AnnaKwa · Pull Request #871 · ai2cm/ace

AnnaKwa · 2026-02-24T20:04:26Z

This PR adds benchmarks for the SongUNetv2 module. The variants are with/without using bfloat16 and apex group norm.
It also adds the option for recording additional diagnostics beyond timer and memory. For this benchmark I wanted to record the number of memory format conversions that occur to check that this is zero for properly configured (allowed set of nchannels) SongUNetv2 models using apex group norm. The output also includes which layers trigger the conversion if this is nonzero.

e.g.

  "diagnostics": {
    "n_channels_last_weights": 119,
    "n_total_4d_weights": 119,
    "n_conversions_to_channels_last": 0,
    "n_conversions_from_channels_last": 0,
    "conversion_layers": []
  }

example of child timings within decoder block:

The regression tests only cover the cases without apex group norm because the testing environment doesn't have apex installed.

Co-authored-by: Cursor <cursoragent@cursor.com>

…nto feature/songunetv2-benchmark

mcgibbon · 2026-02-25T21:01:51Z

fme/downscaling/modules/physicsnemo_unets_v2/benchmark.py

+    def get_diagnostics(self) -> dict:
+        return self._channels_last_diagnostics.to_dict()
+
+    def run_instance(self, timer) -> TensorDict:


Issue: This API gives two ways to get diagnostics, and it's unexpected that they're gathered separately from run_instance.

Suggestion: rather than add a new get_diagnostics method whose results aren't specific to any run_instance invocation, update the return type of run_instance to incorporate these diagnostics somehow. I guess the two types of data returned are "regression" and "diagnostic" data, so maybe a type that has two dict[str: Tensor] attributes, one for each of these?

mcgibbon · 2026-02-25T21:08:09Z

fme/downscaling/modules/physicsnemo_unets_v2/unets.py

+            skips = []
+            aux = x
+            for name, block in self.enc.items():
+                with enc_timer.child(name):


Is it helpful to have a child timer for each block here? Maybe the answer is yes? But alternatively you might consider just having the encoder and decoder children and leaving it there. I found the child blocks were most helpful when they were a) large enough to be a meaningful amount of the time, where I might consider optimizing it, and b) understandable and interpretable by a name for a set of operations. Right now the plot has a lot of unreadable boxes.

The boxes are in order of execution though - it looks like what's happening is a lot of time is being spent in the first and last blocks (the highest resolution/least coarsening?). So maybe this is a useful piece of information to get out of the benchmark.

mcgibbon · 2026-02-25T21:11:22Z

Maybe it would be more helpful to have blocks for "###x###_blocks" all together for example, separate from the _up time.

If you have the called modules also take in Timer, they would be capable of invoking the .child call inside their own forward function, is also a possible way to accomplish it. That way a given "kind" of call always goes into the same bucket to be optimized.

AnnaKwa · 2026-02-26T00:18:59Z

fme/core/benchmark/test_benchmark.py

 def test_run_benchmark():
    def benchmark_fn(timer):
        torch.cuda._sleep(100_000_000)
+        return {}


Expects tensor dict returned, if not will have an error when test_regression checks for key.

Suggestion (optional): Maybe we can update the return type to TensorDict | None, and have None act as a sentinel that indicates regression is not implemented for this checkpoint? Like, it should not be None (and we should check-raise) if the "get regression initial condition" method is implemented.

Re-noting this optional suggestion since you didn't reply on it, it's still optional.

…nto feature/songunetv2-benchmark

mcgibbon · 2026-02-26T13:57:48Z

fme/core/benchmark/test_benchmark.py

 def test_run_benchmark():
    def benchmark_fn(timer):
        torch.cuda._sleep(100_000_000)
+        return {}


Suggestion (optional): Maybe we can update the return type to TensorDict | None, and have None act as a sentinel that indicates regression is not implemented for this checkpoint? Like, it should not be None (and we should check-raise) if the "get regression initial condition" method is implemented.

mcgibbon · 2026-02-26T14:00:33Z

fme/core/benchmark/test_benchmark.py

-            os.path.join(DIR, "testdata", f"{benchmark_name}-regression.pt"),
-        )
+        if "diagnostics" in regression_result:
+            # Split into two files so we don't have to check nested dicts


Issue: There's no constraint that regression_result contain "output" as a key, and if it doesn't this will error.

Suggestion: Update validate_tensor_dict to work on nested dicts instead, using a recursive helper function. It makes this code a bit cleaner/clearer, and avoids needing twice as many regression files.

Updated to work on nested dicts

mcgibbon · 2026-02-26T14:03:14Z

fme/downscaling/modules/physicsnemo_unets_v2/benchmark.py

+
+class SongUNetv2BenchmarkBf16(SongUNetv2Benchmark):
+    @classmethod
+    def new(cls) -> Self:


Comment: I am wary of any inheritance with features added/changed but this is kinda nice.

mcgibbon · 2026-02-26T14:04:03Z

fme/downscaling/modules/physicsnemo_unets_v2/benchmark.py

+register_benchmark("songunetv2")(SongUNetv2Benchmark)
+register_benchmark("songunetv2_bf16")(SongUNetv2BenchmarkBf16)
+register_benchmark("songunetv2_apex")(SongUNetv2BenchmarkApex)
+register_benchmark("songunetv2_apex_bf16")(SongUNetv2BenchmarkApexBf16)


Question: How long do these added benchmarks take to run?

~14 s for the regular benchmark and twice as long for the bf16 regression. Since it's on CPU that is very slow for the unets and apparently bf16 on CPU is also extra slow.

I thought the benchmarks only run on GPU / that it raises an error if we try to benchmark on CPU? The CUDA timer will raise an error if it's run on CPU.

Are you saying the regression tests take 14s? That's quite long, is there any way to get code pathway coverage with a smaller config?

mcgibbon · 2026-02-26T14:07:09Z

fme/downscaling/modules/physicsnemo_unets_v2/unets.py

+            for level_name, level_blocks in itertools.groupby(
+                self.enc.items(), key=lambda item: item[0].split("_", 1)[0]
+            ):
+                with enc_timer.child(level_name):


Comment: I don't suggest changing it this way, but wanted to note that if you call enc_timer.child(level_name): multiple times with the same level_name, it will add the times together when it shows the plot. For example if you had kept the loops as you had it the first time but transformed the names to level names. The main difference is the json logs will indicate the child was called more times. I do slightly prefer this way but wanted to point it out as a feature.

fme/downscaling/modules/physicsnemo_unets_v2/unets.py

mcgibbon · 2026-02-26T14:16:29Z

fme/downscaling/modules/physicsnemo_unets_v2/unets.py

+            for level_name, level_blocks in itertools.groupby(
+                self.enc.items(), key=lambda item: item[0].split("_", 1)[0]
+            ):
+                with enc_timer.child(level_name):


If you decide to add further child timers you can do with enc_timer.child(level_name) as level_timer:

mcgibbon · 2026-02-26T14:16:51Z

A test seems to be failing.

…nto feature/songunetv2-benchmark

…er iterations

AnnaKwa · 2026-02-26T22:41:06Z

fme/core/benchmark/benchmark.py

-        root_avg = avg_time(root)
+        root_count = root.count
+
+        def avg_total_time_per_iter(t: TimerResult) -> float:


Edited this function to add up the times if a timer was repeated. Moved so that root.count was available to reference.

Issue: I'm a little confused when I read "average total time", since these are contradictory, and it's not obvious to me what "per iter" means at first read (I can tell you mean per benchmark iteration, but at first I thought it meant per child call).

Suggestions: rename to avg_time_per_root or avg_time_per_root_iter to make it clearer, or back to avg_time to leave it vague but not misleading/confusing.

mcgibbon · 2026-02-27T15:51:18Z

fme/core/benchmark/benchmark.py

-        root_avg = avg_time(root)
+        root_count = root.count
+
+        def avg_total_time_per_iter(t: TimerResult) -> float:


Issue: I'm a little confused when I read "average total time", since these are contradictory, and it's not obvious to me what "per iter" means at first read (I can tell you mean per benchmark iteration, but at first I thought it meant per child call).

Suggestions: rename to avg_time_per_root or avg_time_per_root_iter to make it clearer, or back to avg_time to leave it vague but not misleading/confusing.

mcgibbon · 2026-02-27T15:55:56Z

fme/core/benchmark/benchmark.py

            for _ in range(iters):
                with timer:
-                    benchmark.run_instance(timer)
+                    benchmark_result = benchmark.run_instance(timer)


Issue: It's a little confusing having this called a benchmark_result but not being the same type as the returned value which is a BenchmarkResult. Maybe iter_result or last_iter_result instead?

Suggestion (optional, better in a later PR if at all): Which also raises a question, do we want just the last diagnostics or do we want the total/average across results? You could keep a running total owned by this method (instead of by the benchmark), similarly to how the local instance of timer is meant to keep the running total of times.

do we want just the last diagnostics or do we want the total/average across results
Good point, probably the latter but in a later PR

mcgibbon · 2026-02-27T15:56:28Z

fme/core/benchmark/run.py

+def _json_default(obj):
+    """json.dumps ``default`` hook that converts tensors to Python scalars."""
+    if isinstance(obj, torch.Tensor):
+        return obj.item()


Question: What will happen if the tensor is not a scalar? Is that behavior expected?

Good catch, fixed so non-scalar tensors are also handled.

This is a good example of why I was wary of using AI code tools to help write functions... I had a feeling I'd get complacent and stuff like this would slip through.

mcgibbon · 2026-02-27T15:57:24Z

fme/core/benchmark/test_benchmark.py

 def test_run_benchmark():
    def benchmark_fn(timer):
        torch.cuda._sleep(100_000_000)
+        return {}


Re-noting this optional suggestion since you didn't reply on it, it's still optional.

mcgibbon · 2026-02-27T15:58:53Z

fme/core/benchmark/test_benchmark.py


 @pytest.mark.parametrize("benchmark_name", BENCHMARKS.keys())
-def test_regression(benchmark_name: str):
+def test_regression(benchmark_name: str, very_fast_only: bool):


Question: Why is this now required? It means we no longer get very-fast coverage for the existing benchmarks also. Is it possible to set up the regression with a small enough problem (maybe a smaller spatial domain in this case) that it runs fast enough?

mcgibbon · 2026-02-27T16:01:04Z

fme/core/testing/regression.py

+
+
+def _assert_close(
+    x: NestedTensorDict, y: NestedTensorDict, prefix: str, **assert_close_kwargs


Issue: The prefix logic is confusing, and I don't understand what its use case is. It seems to be hard-coded to "" below. Can we remove it?

mcgibbon · 2026-02-27T16:03:14Z

fme/core/testing/regression.py

+        if isinstance(v, torch.Tensor):
+            assert isinstance(
+                y_val, torch.Tensor
+            ), f"Expected tensor at '{key_path}' but got dict"


Issue: This assertion error will be wrong/misleading if y_val is another non-tensor type, such as None. I know in practice the way you're currently using this function that can't happen (since it always comes from a loaded dict), but the call signature of this helper function doesn't know that.

Suggestion: Maybe use a "got {type(y_val)}"?

mcgibbon · 2026-02-27T16:06:52Z

fme/downscaling/modules/physicsnemo_unets_v2/benchmark.py

+register_benchmark("songunetv2")(SongUNetv2Benchmark)
+register_benchmark("songunetv2_bf16")(SongUNetv2BenchmarkBf16)
+register_benchmark("songunetv2_apex")(SongUNetv2BenchmarkApex)
+register_benchmark("songunetv2_apex_bf16")(SongUNetv2BenchmarkApexBf16)


I thought the benchmarks only run on GPU / that it raises an error if we try to benchmark on CPU? The CUDA timer will raise an error if it's run on CPU.

Are you saying the regression tests take 14s? That's quite long, is there any way to get code pathway coverage with a smaller config?

mcgibbon · 2026-02-27T16:11:10Z

fme/downscaling/modules/physicsnemo_unets_v2/benchmark.py

+        return cls._new_with_params(
+            img_resolution=32,
+            B=1,
+            in_channels=6,
+            out_channels=4,
+            label_dim=0,
+            model_channels=64,
+            channel_mult=[1, 2, 2, 2],
+            use_apex_gn=False,
+        )


Sorry for not catching this earlier.

Issue: the configuration here is identical to the configuration for new. You want to make the new configuration one that is helpful for debugging/diagnosing optimization effects on GPU and fully occupies the GPU. In new_for_regression, you want a config that runs sufficiently fast for regression testing on CPU, and produces a small enough checkpoint to save to the repo.

Suggestion:

Suggested change

return cls._new_with_params(

img_resolution=32,

B=1,

in_channels=6,

out_channels=4,

label_dim=0,

model_channels=64,

channel_mult=[1, 2, 2, 2],

use_apex_gn=False,

)

return cls._new_with_params(

img_resolution=8,

B=1,

in_channels=6,

out_channels=4,

label_dim=0,

model_channels=16,

channel_mult=[1, 2],

use_apex_gn=False,

)

(assuming len(channel_mult) is what determines the depth of the u-net)

I would suggest similar changes in the other benchmark(s). With the faster runtime you could consider including regression on more of the benchmarks.

Thanks, the existing two are now on par with the duration of the others (<.1s). I removed the --very-fast-only part.

I also updated the regular new params to match the apex benchmarks for better comparison.

I'll still exclude the 'with apex' benchmarks from regression testing as the unit testing environment doesn't have apex installed (it takes a while).

AnnaKwa · 2026-02-27T17:51:08Z

Suggestion (optional): Maybe we can update the return type to TensorDict | None, and have None act as a sentinel that indicates regression is not implemented for this checkpoint? Like, it should not be None (and we should check-raise) if the "get regression initial condition" method is implemented.

for some reason could reply to original comment

Left as is, since this isn't a regression test and the later references in BenchmarkResult assume a dict type return (diagnostics=benchmark_result.get("diagnostics", {}))

root and others added 16 commits February 23, 2026 23:32

Add script to check memory format of checkpoint weights

2ed9a4d

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into annak/check-channels-last

5b00069

benchmark for songunetv2

ea33c60

add diagnostics for memory format conversions

b552864

fix bf16 subclass

78ef9c8

gen regtest outputs

a0b9ba8

delete standalone script

041e003

run regtests without apex

2d169d1

use more realistic config for regtests

958f061

record in diagnostics which layers triggered conversion

a6698ba

Merge branch 'main' into feature/songunetv2-benchmark

929de5b

fix default arg

ef3773c

very fast only

572b74c

Merge branch 'feature/songunetv2-benchmark' of github.com:ai2cm/ace i…

71a1d40

…nto feature/songunetv2-benchmark

reduce memory required for cpu test

efb193f

edit skip msg

2d1e756

mcgibbon reviewed Feb 25, 2026

View reviewed changes

AnnaKwa added 3 commits February 25, 2026 21:25

Respond to PR comments

0925bfc

save output and diagnostics as separate regression files if applicable

e284b99

format

520c8e7

AnnaKwa commented Feb 26, 2026

View reviewed changes

AnnaKwa added 2 commits February 26, 2026 00:20

clarify comment

2c1ea86

Merge branch 'feature/songunetv2-benchmark' of github.com:ai2cm/ace i…

af79df5

…nto feature/songunetv2-benchmark

mcgibbon reviewed Feb 26, 2026

View reviewed changes

fme/downscaling/modules/physicsnemo_unets_v2/unets.py Show resolved Hide resolved

mcgibbon reviewed Feb 26, 2026

View reviewed changes

AnnaKwa added 2 commits February 26, 2026 09:11

Merge branch 'main' into feature/songunetv2-benchmark

4460f5a

make and use nested dict check helper

48c6c8c

AnnaKwa added 7 commits February 26, 2026 13:04

pass assert close kwargs

18cea5b

fix use_amp flag

e94cf78

Merge branch 'feature/songunetv2-benchmark' of github.com:ai2cm/ace i…

4210abf

…nto feature/songunetv2-benchmark

regenerate regression for correct bf16 context

158afca

add child timers within encoder/decoder

3fad838

option to show all lables

ef31ff6

add up timings for repeated counts, while keeps total time the avg ov…

abcd5f8

…er iterations

AnnaKwa commented Feb 26, 2026

View reviewed changes

AnnaKwa added 3 commits February 26, 2026 14:49

reformat

c319422

Merge branch 'main' into feature/songunetv2-benchmark

dac1d47

break line

483885f

mcgibbon reviewed Feb 27, 2026

View reviewed changes

Respond to PR comments

9d17f2e



		def _assert_close(
		x: NestedTensorDict, y: NestedTensorDict, prefix: str, **assert_close_kwargs

Conversation

AnnaKwa commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcgibbon commented Feb 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AnnaKwa Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AnnaKwa commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnnaKwa commented Feb 24, 2026 •

edited

Loading

AnnaKwa Feb 27, 2026 •

edited

Loading