Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
…nto feature/songunetv2-benchmark
| def get_diagnostics(self) -> dict: | ||
| return self._channels_last_diagnostics.to_dict() | ||
|
|
||
| def run_instance(self, timer) -> TensorDict: |
There was a problem hiding this comment.
Issue: This API gives two ways to get diagnostics, and it's unexpected that they're gathered separately from run_instance.
Suggestion: rather than add a new get_diagnostics method whose results aren't specific to any run_instance invocation, update the return type of run_instance to incorporate these diagnostics somehow. I guess the two types of data returned are "regression" and "diagnostic" data, so maybe a type that has two dict[str: Tensor] attributes, one for each of these?
| skips = [] | ||
| aux = x | ||
| for name, block in self.enc.items(): | ||
| with enc_timer.child(name): |
There was a problem hiding this comment.
Is it helpful to have a child timer for each block here? Maybe the answer is yes? But alternatively you might consider just having the encoder and decoder children and leaving it there. I found the child blocks were most helpful when they were a) large enough to be a meaningful amount of the time, where I might consider optimizing it, and b) understandable and interpretable by a name for a set of operations. Right now the plot has a lot of unreadable boxes.
The boxes are in order of execution though - it looks like what's happening is a lot of time is being spent in the first and last blocks (the highest resolution/least coarsening?). So maybe this is a useful piece of information to get out of the benchmark.
|
Maybe it would be more helpful to have blocks for "###x###_blocks" all together for example, separate from the _up time. If you have the called modules also take in Timer, they would be capable of invoking the .child call inside their own forward function, is also a possible way to accomplish it. That way a given "kind" of call always goes into the same bucket to be optimized. |
| def test_run_benchmark(): | ||
| def benchmark_fn(timer): | ||
| torch.cuda._sleep(100_000_000) | ||
| return {} |
There was a problem hiding this comment.
Expects tensor dict returned, if not will have an error when test_regression checks for key.
There was a problem hiding this comment.
Suggestion (optional): Maybe we can update the return type to TensorDict | None, and have None act as a sentinel that indicates regression is not implemented for this checkpoint? Like, it should not be None (and we should check-raise) if the "get regression initial condition" method is implemented.
There was a problem hiding this comment.
Re-noting this optional suggestion since you didn't reply on it, it's still optional.
…nto feature/songunetv2-benchmark
| def test_run_benchmark(): | ||
| def benchmark_fn(timer): | ||
| torch.cuda._sleep(100_000_000) | ||
| return {} |
There was a problem hiding this comment.
Suggestion (optional): Maybe we can update the return type to TensorDict | None, and have None act as a sentinel that indicates regression is not implemented for this checkpoint? Like, it should not be None (and we should check-raise) if the "get regression initial condition" method is implemented.
fme/core/benchmark/test_benchmark.py
Outdated
| os.path.join(DIR, "testdata", f"{benchmark_name}-regression.pt"), | ||
| ) | ||
| if "diagnostics" in regression_result: | ||
| # Split into two files so we don't have to check nested dicts |
There was a problem hiding this comment.
Issue: There's no constraint that regression_result contain "output" as a key, and if it doesn't this will error.
Suggestion: Update validate_tensor_dict to work on nested dicts instead, using a recursive helper function. It makes this code a bit cleaner/clearer, and avoids needing twice as many regression files.
There was a problem hiding this comment.
Updated to work on nested dicts
|
|
||
| class SongUNetv2BenchmarkBf16(SongUNetv2Benchmark): | ||
| @classmethod | ||
| def new(cls) -> Self: |
There was a problem hiding this comment.
Comment: I am wary of any inheritance with features added/changed but this is kinda nice.
| register_benchmark("songunetv2")(SongUNetv2Benchmark) | ||
| register_benchmark("songunetv2_bf16")(SongUNetv2BenchmarkBf16) | ||
| register_benchmark("songunetv2_apex")(SongUNetv2BenchmarkApex) | ||
| register_benchmark("songunetv2_apex_bf16")(SongUNetv2BenchmarkApexBf16) |
There was a problem hiding this comment.
Question: How long do these added benchmarks take to run?
There was a problem hiding this comment.
~14 s for the regular benchmark and twice as long for the bf16 regression. Since it's on CPU that is very slow for the unets and apparently bf16 on CPU is also extra slow.
There was a problem hiding this comment.
I thought the benchmarks only run on GPU / that it raises an error if we try to benchmark on CPU? The CUDA timer will raise an error if it's run on CPU.
Are you saying the regression tests take 14s? That's quite long, is there any way to get code pathway coverage with a smaller config?
| for level_name, level_blocks in itertools.groupby( | ||
| self.enc.items(), key=lambda item: item[0].split("_", 1)[0] | ||
| ): | ||
| with enc_timer.child(level_name): |
There was a problem hiding this comment.
Comment: I don't suggest changing it this way, but wanted to note that if you call enc_timer.child(level_name): multiple times with the same level_name, it will add the times together when it shows the plot. For example if you had kept the loops as you had it the first time but transformed the names to level names. The main difference is the json logs will indicate the child was called more times. I do slightly prefer this way but wanted to point it out as a feature.
| for level_name, level_blocks in itertools.groupby( | ||
| self.enc.items(), key=lambda item: item[0].split("_", 1)[0] | ||
| ): | ||
| with enc_timer.child(level_name): |
There was a problem hiding this comment.
If you decide to add further child timers you can do with enc_timer.child(level_name) as level_timer:
|
A test seems to be failing. |
…nto feature/songunetv2-benchmark
fme/core/benchmark/benchmark.py
Outdated
| root_avg = avg_time(root) | ||
| root_count = root.count | ||
|
|
||
| def avg_total_time_per_iter(t: TimerResult) -> float: |
There was a problem hiding this comment.
Edited this function to add up the times if a timer was repeated. Moved so that root.count was available to reference.
There was a problem hiding this comment.
Issue: I'm a little confused when I read "average total time", since these are contradictory, and it's not obvious to me what "per iter" means at first read (I can tell you mean per benchmark iteration, but at first I thought it meant per child call).
Suggestions: rename to avg_time_per_root or avg_time_per_root_iter to make it clearer, or back to avg_time to leave it vague but not misleading/confusing.
fme/core/benchmark/benchmark.py
Outdated
| root_avg = avg_time(root) | ||
| root_count = root.count | ||
|
|
||
| def avg_total_time_per_iter(t: TimerResult) -> float: |
There was a problem hiding this comment.
Issue: I'm a little confused when I read "average total time", since these are contradictory, and it's not obvious to me what "per iter" means at first read (I can tell you mean per benchmark iteration, but at first I thought it meant per child call).
Suggestions: rename to avg_time_per_root or avg_time_per_root_iter to make it clearer, or back to avg_time to leave it vague but not misleading/confusing.
| for _ in range(iters): | ||
| with timer: | ||
| benchmark.run_instance(timer) | ||
| benchmark_result = benchmark.run_instance(timer) |
There was a problem hiding this comment.
Issue: It's a little confusing having this called a benchmark_result but not being the same type as the returned value which is a BenchmarkResult. Maybe iter_result or last_iter_result instead?
Suggestion (optional, better in a later PR if at all): Which also raises a question, do we want just the last diagnostics or do we want the total/average across results? You could keep a running total owned by this method (instead of by the benchmark), similarly to how the local instance of timer is meant to keep the running total of times.
There was a problem hiding this comment.
do we want just the last diagnostics or do we want the total/average across results
Good point, probably the latter but in a later PR
fme/core/benchmark/run.py
Outdated
| def _json_default(obj): | ||
| """json.dumps ``default`` hook that converts tensors to Python scalars.""" | ||
| if isinstance(obj, torch.Tensor): | ||
| return obj.item() |
There was a problem hiding this comment.
Question: What will happen if the tensor is not a scalar? Is that behavior expected?
There was a problem hiding this comment.
Good catch, fixed so non-scalar tensors are also handled.
This is a good example of why I was wary of using AI code tools to help write functions... I had a feeling I'd get complacent and stuff like this would slip through.
| def test_run_benchmark(): | ||
| def benchmark_fn(timer): | ||
| torch.cuda._sleep(100_000_000) | ||
| return {} |
There was a problem hiding this comment.
Re-noting this optional suggestion since you didn't reply on it, it's still optional.
fme/core/benchmark/test_benchmark.py
Outdated
|
|
||
| @pytest.mark.parametrize("benchmark_name", BENCHMARKS.keys()) | ||
| def test_regression(benchmark_name: str): | ||
| def test_regression(benchmark_name: str, very_fast_only: bool): |
There was a problem hiding this comment.
Question: Why is this now required? It means we no longer get very-fast coverage for the existing benchmarks also. Is it possible to set up the regression with a small enough problem (maybe a smaller spatial domain in this case) that it runs fast enough?
fme/core/testing/regression.py
Outdated
|
|
||
|
|
||
| def _assert_close( | ||
| x: NestedTensorDict, y: NestedTensorDict, prefix: str, **assert_close_kwargs |
There was a problem hiding this comment.
Issue: The prefix logic is confusing, and I don't understand what its use case is. It seems to be hard-coded to "" below. Can we remove it?
fme/core/testing/regression.py
Outdated
| if isinstance(v, torch.Tensor): | ||
| assert isinstance( | ||
| y_val, torch.Tensor | ||
| ), f"Expected tensor at '{key_path}' but got dict" |
There was a problem hiding this comment.
Issue: This assertion error will be wrong/misleading if y_val is another non-tensor type, such as None. I know in practice the way you're currently using this function that can't happen (since it always comes from a loaded dict), but the call signature of this helper function doesn't know that.
Suggestion: Maybe use a "got {type(y_val)}"?
| register_benchmark("songunetv2")(SongUNetv2Benchmark) | ||
| register_benchmark("songunetv2_bf16")(SongUNetv2BenchmarkBf16) | ||
| register_benchmark("songunetv2_apex")(SongUNetv2BenchmarkApex) | ||
| register_benchmark("songunetv2_apex_bf16")(SongUNetv2BenchmarkApexBf16) |
There was a problem hiding this comment.
I thought the benchmarks only run on GPU / that it raises an error if we try to benchmark on CPU? The CUDA timer will raise an error if it's run on CPU.
Are you saying the regression tests take 14s? That's quite long, is there any way to get code pathway coverage with a smaller config?
| return cls._new_with_params( | ||
| img_resolution=32, | ||
| B=1, | ||
| in_channels=6, | ||
| out_channels=4, | ||
| label_dim=0, | ||
| model_channels=64, | ||
| channel_mult=[1, 2, 2, 2], | ||
| use_apex_gn=False, | ||
| ) |
There was a problem hiding this comment.
Sorry for not catching this earlier.
Issue: the configuration here is identical to the configuration for new. You want to make the new configuration one that is helpful for debugging/diagnosing optimization effects on GPU and fully occupies the GPU. In new_for_regression, you want a config that runs sufficiently fast for regression testing on CPU, and produces a small enough checkpoint to save to the repo.
Suggestion:
| return cls._new_with_params( | |
| img_resolution=32, | |
| B=1, | |
| in_channels=6, | |
| out_channels=4, | |
| label_dim=0, | |
| model_channels=64, | |
| channel_mult=[1, 2, 2, 2], | |
| use_apex_gn=False, | |
| ) | |
| return cls._new_with_params( | |
| img_resolution=8, | |
| B=1, | |
| in_channels=6, | |
| out_channels=4, | |
| label_dim=0, | |
| model_channels=16, | |
| channel_mult=[1, 2], | |
| use_apex_gn=False, | |
| ) |
(assuming len(channel_mult) is what determines the depth of the u-net)
I would suggest similar changes in the other benchmark(s). With the faster runtime you could consider including regression on more of the benchmarks.
There was a problem hiding this comment.
Thanks, the existing two are now on par with the duration of the others (<.1s). I removed the --very-fast-only part.
I also updated the regular new params to match the apex benchmarks for better comparison.
I'll still exclude the 'with apex' benchmarks from regression testing as the unit testing environment doesn't have apex installed (it takes a while).
for some reason could reply to original comment Left as is, since this isn't a regression test and the later references in |
This PR adds benchmarks for the SongUNetv2 module. The variants are with/without using bfloat16 and apex group norm.
It also adds the option for recording additional diagnostics beyond timer and memory. For this benchmark I wanted to record the number of memory format conversions that occur to check that this is zero for properly configured (allowed set of nchannels) SongUNetv2 models using apex group norm. The output also includes which layers trigger the conversion if this is nonzero.
e.g.
example of child timings within decoder block:

The regression tests only cover the cases without apex group norm because the testing environment doesn't have apex installed.