[1/n] Add generalized event types and GPU Performance Monitoring counter event support #1212
+185
−92
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Originally, this was part 1 of splitting PR #1148. It supports a new kind of GPU Counter events that will be published to the timeline as a time series. In the process I realized we should add more generic event types for accelerators rather than being tied to CUDA specific naming. This has historically lead to each new accelerator adding it's own events which is maintenance burden.
ActivityTypeenum. There are now aliases in the enum class definition for older events likeCUDA_RUNTIME,MTIA_RUNTIMEto name a few. The dynamic plugin will leverage generic events so upstream source code changes will not be required for new accelerators.Details
Accelerator-Agnostic Event Types
The change reorganizes the
ActivityTypeenum class to introduce generic, accelerator-agnostic event types that work across all hardware backends (CUDA, MTIA, HPU, XPU, etc.). Device-specific types are now deprecated aliases pointing to their generic counterparts. There are few corner case exceptions likeMTIA_INSIGHT,CUDA_SYNC, see the header.New Generic Event Types
RUNTIMECUDA_RUNTIME,MTIA_RUNTIME,GLOW_RUNTIME,XPU_RUNTIME,PRIVATEUSE1_RUNTIME,HPU_OPDRIVERCUDA_DRIVER,PRIVATEUSE1_DRIVERCONCURRENT_KERNELMTIA_CCP_EVENTSGPU_PM_COUNTERGuidance for future use of ActivityTypes
Existing code using deprecated aliases will continue to work, but new code should use the generic types:
I have not changed the usage of these types in the code base yet. That can happen in a follow up change.
Notes
defaultActivityTypesArrayis now constexpr, enabling compile-time evaluation and eliminating runtime overhead.GPU PM Counter Events
This is a straightforward change to emit Chrome Trace counter events] for counters obtained from the GPU. The event can be leveraged by any accelerator backend.
The values of the counters are embedded as key/val pairs in the output json
Testing
The
ParseTest.ActivityTypesvalidates that aliases in the Config string are correctly converted to the underlying base type. Since the enum class uses aliases all existing code in kineto that uses the original activity types continues to compile and work as expected.The test above also checks default activity types are unchanged
https://github.com/pytorch/kineto/blob/main/libkineto/test/ConfigTest.cpp#L89-L108