Skip to content

Add benchmark for SongUnetv2#871

Open
AnnaKwa wants to merge 34 commits intomainfrom
feature/songunetv2-benchmark
Open

Add benchmark for SongUnetv2#871
AnnaKwa wants to merge 34 commits intomainfrom
feature/songunetv2-benchmark

Conversation

@AnnaKwa
Copy link
Contributor

@AnnaKwa AnnaKwa commented Feb 24, 2026

This PR adds benchmarks for the SongUNetv2 module. The variants are with/without using bfloat16 and apex group norm.
It also adds the option for recording additional diagnostics beyond timer and memory. For this benchmark I wanted to record the number of memory format conversions that occur to check that this is zero for properly configured (allowed set of nchannels) SongUNetv2 models using apex group norm. The output also includes which layers trigger the conversion if this is nonzero.

e.g.

  "diagnostics": {
    "n_channels_last_weights": 119,
    "n_total_4d_weights": 119,
    "n_conversions_to_channels_last": 0,
    "n_conversions_from_channels_last": 0,
    "conversion_layers": []
  }
image

example of child timings within decoder block:
image

The regression tests only cover the cases without apex group norm because the testing environment doesn't have apex installed.

Comment on lines 194 to 197
def get_diagnostics(self) -> dict:
return self._channels_last_diagnostics.to_dict()

def run_instance(self, timer) -> TensorDict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This API gives two ways to get diagnostics, and it's unexpected that they're gathered separately from run_instance.

Suggestion: rather than add a new get_diagnostics method whose results aren't specific to any run_instance invocation, update the return type of run_instance to incorporate these diagnostics somehow. I guess the two types of data returned are "regression" and "diagnostic" data, so maybe a type that has two dict[str: Tensor] attributes, one for each of these?

skips = []
aux = x
for name, block in self.enc.items():
with enc_timer.child(name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it helpful to have a child timer for each block here? Maybe the answer is yes? But alternatively you might consider just having the encoder and decoder children and leaving it there. I found the child blocks were most helpful when they were a) large enough to be a meaningful amount of the time, where I might consider optimizing it, and b) understandable and interpretable by a name for a set of operations. Right now the plot has a lot of unreadable boxes.

The boxes are in order of execution though - it looks like what's happening is a lot of time is being spent in the first and last blocks (the highest resolution/least coarsening?). So maybe this is a useful piece of information to get out of the benchmark.

@mcgibbon
Copy link
Contributor

Maybe it would be more helpful to have blocks for "###x###_blocks" all together for example, separate from the _up time.

If you have the called modules also take in Timer, they would be capable of invoking the .child call inside their own forward function, is also a possible way to accomplish it. That way a given "kind" of call always goes into the same bucket to be optimized.

def test_run_benchmark():
def benchmark_fn(timer):
torch.cuda._sleep(100_000_000)
return {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expects tensor dict returned, if not will have an error when test_regression checks for key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (optional): Maybe we can update the return type to TensorDict | None, and have None act as a sentinel that indicates regression is not implemented for this checkpoint? Like, it should not be None (and we should check-raise) if the "get regression initial condition" method is implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-noting this optional suggestion since you didn't reply on it, it's still optional.

def test_run_benchmark():
def benchmark_fn(timer):
torch.cuda._sleep(100_000_000)
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (optional): Maybe we can update the return type to TensorDict | None, and have None act as a sentinel that indicates regression is not implemented for this checkpoint? Like, it should not be None (and we should check-raise) if the "get regression initial condition" method is implemented.

os.path.join(DIR, "testdata", f"{benchmark_name}-regression.pt"),
)
if "diagnostics" in regression_result:
# Split into two files so we don't have to check nested dicts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: There's no constraint that regression_result contain "output" as a key, and if it doesn't this will error.

Suggestion: Update validate_tensor_dict to work on nested dicts instead, using a recursive helper function. It makes this code a bit cleaner/clearer, and avoids needing twice as many regression files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to work on nested dicts


class SongUNetv2BenchmarkBf16(SongUNetv2Benchmark):
@classmethod
def new(cls) -> Self:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: I am wary of any inheritance with features added/changed but this is kinda nice.

Comment on lines +347 to +350
register_benchmark("songunetv2")(SongUNetv2Benchmark)
register_benchmark("songunetv2_bf16")(SongUNetv2BenchmarkBf16)
register_benchmark("songunetv2_apex")(SongUNetv2BenchmarkApex)
register_benchmark("songunetv2_apex_bf16")(SongUNetv2BenchmarkApexBf16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: How long do these added benchmarks take to run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~14 s for the regular benchmark and twice as long for the bf16 regression. Since it's on CPU that is very slow for the unets and apparently bf16 on CPU is also extra slow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the benchmarks only run on GPU / that it raises an error if we try to benchmark on CPU? The CUDA timer will raise an error if it's run on CPU.

Are you saying the regression tests take 14s? That's quite long, is there any way to get code pathway coverage with a smaller config?

for level_name, level_blocks in itertools.groupby(
self.enc.items(), key=lambda item: item[0].split("_", 1)[0]
):
with enc_timer.child(level_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: I don't suggest changing it this way, but wanted to note that if you call enc_timer.child(level_name): multiple times with the same level_name, it will add the times together when it shows the plot. For example if you had kept the loops as you had it the first time but transformed the names to level names. The main difference is the json logs will indicate the child was called more times. I do slightly prefer this way but wanted to point it out as a feature.

for level_name, level_blocks in itertools.groupby(
self.enc.items(), key=lambda item: item[0].split("_", 1)[0]
):
with enc_timer.child(level_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide to add further child timers you can do with enc_timer.child(level_name) as level_timer:

@mcgibbon
Copy link
Contributor

A test seems to be failing.

root_avg = avg_time(root)
root_count = root.count

def avg_total_time_per_iter(t: TimerResult) -> float:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited this function to add up the times if a timer was repeated. Moved so that root.count was available to reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: I'm a little confused when I read "average total time", since these are contradictory, and it's not obvious to me what "per iter" means at first read (I can tell you mean per benchmark iteration, but at first I thought it meant per child call).

Suggestions: rename to avg_time_per_root or avg_time_per_root_iter to make it clearer, or back to avg_time to leave it vague but not misleading/confusing.

root_avg = avg_time(root)
root_count = root.count

def avg_total_time_per_iter(t: TimerResult) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: I'm a little confused when I read "average total time", since these are contradictory, and it's not obvious to me what "per iter" means at first read (I can tell you mean per benchmark iteration, but at first I thought it meant per child call).

Suggestions: rename to avg_time_per_root or avg_time_per_root_iter to make it clearer, or back to avg_time to leave it vague but not misleading/confusing.

for _ in range(iters):
with timer:
benchmark.run_instance(timer)
benchmark_result = benchmark.run_instance(timer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: It's a little confusing having this called a benchmark_result but not being the same type as the returned value which is a BenchmarkResult. Maybe iter_result or last_iter_result instead?

Suggestion (optional, better in a later PR if at all): Which also raises a question, do we want just the last diagnostics or do we want the total/average across results? You could keep a running total owned by this method (instead of by the benchmark), similarly to how the local instance of timer is meant to keep the running total of times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want just the last diagnostics or do we want the total/average across results
Good point, probably the latter but in a later PR

def _json_default(obj):
"""json.dumps ``default`` hook that converts tensors to Python scalars."""
if isinstance(obj, torch.Tensor):
return obj.item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What will happen if the tensor is not a scalar? Is that behavior expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed so non-scalar tensors are also handled.

This is a good example of why I was wary of using AI code tools to help write functions... I had a feeling I'd get complacent and stuff like this would slip through.

def test_run_benchmark():
def benchmark_fn(timer):
torch.cuda._sleep(100_000_000)
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-noting this optional suggestion since you didn't reply on it, it's still optional.


@pytest.mark.parametrize("benchmark_name", BENCHMARKS.keys())
def test_regression(benchmark_name: str):
def test_regression(benchmark_name: str, very_fast_only: bool):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why is this now required? It means we no longer get very-fast coverage for the existing benchmarks also. Is it possible to set up the regression with a small enough problem (maybe a smaller spatial domain in this case) that it runs fast enough?



def _assert_close(
x: NestedTensorDict, y: NestedTensorDict, prefix: str, **assert_close_kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The prefix logic is confusing, and I don't understand what its use case is. It seems to be hard-coded to "" below. Can we remove it?

if isinstance(v, torch.Tensor):
assert isinstance(
y_val, torch.Tensor
), f"Expected tensor at '{key_path}' but got dict"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This assertion error will be wrong/misleading if y_val is another non-tensor type, such as None. I know in practice the way you're currently using this function that can't happen (since it always comes from a loaded dict), but the call signature of this helper function doesn't know that.

Suggestion: Maybe use a "got {type(y_val)}"?

Comment on lines +347 to +350
register_benchmark("songunetv2")(SongUNetv2Benchmark)
register_benchmark("songunetv2_bf16")(SongUNetv2BenchmarkBf16)
register_benchmark("songunetv2_apex")(SongUNetv2BenchmarkApex)
register_benchmark("songunetv2_apex_bf16")(SongUNetv2BenchmarkApexBf16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the benchmarks only run on GPU / that it raises an error if we try to benchmark on CPU? The CUDA timer will raise an error if it's run on CPU.

Are you saying the regression tests take 14s? That's quite long, is there any way to get code pathway coverage with a smaller config?

Comment on lines 269 to 278
return cls._new_with_params(
img_resolution=32,
B=1,
in_channels=6,
out_channels=4,
label_dim=0,
model_channels=64,
channel_mult=[1, 2, 2, 2],
use_apex_gn=False,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not catching this earlier.

Issue: the configuration here is identical to the configuration for new. You want to make the new configuration one that is helpful for debugging/diagnosing optimization effects on GPU and fully occupies the GPU. In new_for_regression, you want a config that runs sufficiently fast for regression testing on CPU, and produces a small enough checkpoint to save to the repo.

Suggestion:

Suggested change
return cls._new_with_params(
img_resolution=32,
B=1,
in_channels=6,
out_channels=4,
label_dim=0,
model_channels=64,
channel_mult=[1, 2, 2, 2],
use_apex_gn=False,
)
return cls._new_with_params(
img_resolution=8,
B=1,
in_channels=6,
out_channels=4,
label_dim=0,
model_channels=16,
channel_mult=[1, 2],
use_apex_gn=False,
)

(assuming len(channel_mult) is what determines the depth of the u-net)

I would suggest similar changes in the other benchmark(s). With the faster runtime you could consider including regression on more of the benchmarks.

Copy link
Contributor Author

@AnnaKwa AnnaKwa Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the existing two are now on par with the duration of the others (<.1s). I removed the --very-fast-only part.

I also updated the regular new params to match the apex benchmarks for better comparison.

I'll still exclude the 'with apex' benchmarks from regression testing as the unit testing environment doesn't have apex installed (it takes a while).

@AnnaKwa
Copy link
Contributor Author

AnnaKwa commented Feb 27, 2026

Suggestion (optional): Maybe we can update the return type to TensorDict | None, and have None act as a sentinel that indicates regression is not implemented for this checkpoint? Like, it should not be None (and we should check-raise) if the "get regression initial condition" method is implemented.

for some reason could reply to original comment

Left as is, since this isn't a regression test and the later references in BenchmarkResult assume a dict type return (diagnostics=benchmark_result.get("diagnostics", {}))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants