diff --git a/content/en/events/upcoming-events/gsoc-2026.md b/content/en/events/upcoming-events/gsoc-2026.md index ed2529cea1..5ec96bca71 100644 --- a/content/en/events/upcoming-events/gsoc-2026.md +++ b/content/en/events/upcoming-events/gsoc-2026.md @@ -297,3 +297,55 @@ Tracking issue: https://github.com/kubeflow/sdk/issues/238 - Familiarity with the Kubeflow SDK and Trainer codebase. - Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts. - Engage and contribute to Kubeflow community on Slack and GitHub. + +### Project 10: Dynamic LLM Trainer Framework for Kubeflow + +**Components:** +[kubeflow/trainer](https://www.github.com/kubeflow/trainer), +[kubeflow/sdk](https://www.github.com/kubeflow/sdk) + +**Mentors:** +[@tariq-hasan](https://github.com/tariq-hasan), +[@andreyvelich](https://github.com/andreyvelich) + +**Contributor:** + +**Details:** + +Kubeflow Trainer provides Kubernetes-native distributed ML training with a Python-first experience. It currently supports LLM fine-tuning through TorchTune as a built-in backend, but TorchTune is no longer actively adding new features, limiting support for emerging models and post-training methods (DPO, PPO, ORPO). + +This project proposes a **Dynamic LLM Trainer Framework** that decouples Kubeflow Trainer from any single fine-tuning backend. The goal is to introduce a pluggable architecture enabling multiple frameworks to integrate seamlessly while preserving backward compatibility and a simple Python SDK. This builds on the existing plugin architecture in `pkg/runtime/framework/plugins/torch/` and extends the `BuiltinTrainer` pattern in the SDK. + +**The framework will provide:** + +- A backend-agnostic LLM Trainer interface, symmetric to TrainingRuntime on the control plane +- Dynamic backend registration for in-tree and external frameworks +- TorchTune refactored as a first-class pluggable backend +- Faster day-0/day-1 support for new models and fine-tuning strategies +- Backward compatibility for existing TorchTune-based workflows + +**Initial backends to explore:** + +| Backend | Rationale | +|---------|-----------| +| TorchTune | Preserve existing functionality | +| TRL | Industry standard for SFT/DPO/PPO | +| Unsloth | ~2× faster, ~70% lower memory | +| LlamaFactory | 100+ model support | + +Beyond in-tree backends, the SDK should support external framework registration, mirroring how TrainingRuntime enables custom runtimes. + +This project is well-suited for contributors interested in ML systems, API design, and bridging modern LLM tooling with production Kubernetes platforms. + +Tracking issue: [kubeflow/trainer#2839](https://github.com/kubeflow/trainer/issues/2839) + +**Difficulty:** Hard + +**Size:** 350 hours (Large) + +**Skills Required/Preferred:** +* Python, Go +* Familiarity with Kubernetes and Kubeflow Trainer architecture +* Experience with LLM fine-tuning frameworks (TRL, TorchTune, Unsloth) +* Understanding of distributed training concepts +* Interest in API and framework design