Training Systems & AI Infrastructure Engineer
I work on large-scale training systems and AI infrastructure, focusing on distributed training, performance optimization, and reproducible workflows.
My interests span system-level tooling, backend integration, and benchmarking for large-scale model training.
- Large-scale model training systems
- Distributed & heterogeneous computing
- Performance benchmarking & diagnostics
- Training infrastructure tooling
- Languages: Python, C++, Bash
- Frameworks: PyTorch, Megatron-style training systems
- Systems: Distributed training, GPU performance, profiling
- Tooling: Benchmarking, preflight checks, reproducibility
- Unified training entry and tooling for multi-backend systems
- Training performance benchmarking and diagnostics
- Distributed training workflow design
📫 Contact
- Email: xmpeng.dev@gmail.com
- GitHub: https://github.com/xmpeng-dev