Hi Team, I am one of the maintainers of OpenLIT and we support Obserability for GPUs (Both NVIDIA and AMD), It is all OpenTelemetry-native so the metrics can be sent to any platform like Grafana or any OSS OTel tools like this is an example for LLM O11y
I see this repo houses Deep Learning examples so not sure if its the right place for it?
Lemme know and I can add it. Also are you folks by any chance looking an updated version of https://lambdalabs.com/blog/keeping-an-eye-on-your-gpus-2?srsltid=AfmBOoqoMEvfPZLRFZ-ZhL0Get6F7rMOHWVOaDo2ovkY6dtcN2vpyFPV ?
Would love to contribute!