Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions content/en/events/upcoming-events/gsoc-2026.md
Original file line number Diff line number Diff line change
Expand Up @@ -297,3 +297,55 @@ Tracking issue: https://github.com/kubeflow/sdk/issues/238
- Familiarity with the Kubeflow SDK and Trainer codebase.
- Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts.
- Engage and contribute to Kubeflow community on Slack and GitHub.

### Project 10: Dynamic LLM Trainer Framework for Kubeflow

**Components:**
[kubeflow/trainer](https://www.github.com/kubeflow/trainer),
[kubeflow/sdk](https://www.github.com/kubeflow/sdk)

**Mentors:**
[@tariq-hasan](https://github.com/tariq-hasan),
[@andreyvelich](https://github.com/andreyvelich)
Comment on lines +307 to +309
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya @astefanutti @szaher @abhijeet-dhumal Do you want to help @tariq-hasan with mentoring student for this work as well?
I guess, that can be a nice work for Kubeflow SDK enhancements.


**Contributor:**

**Details:**

Kubeflow Trainer provides Kubernetes-native distributed ML training with a Python-first experience. It currently supports LLM fine-tuning through TorchTune as a built-in backend, but TorchTune is no longer actively adding new features, limiting support for emerging models and post-training methods (DPO, PPO, ORPO).

This project proposes a **Dynamic LLM Trainer Framework** that decouples Kubeflow Trainer from any single fine-tuning backend. The goal is to introduce a pluggable architecture enabling multiple frameworks to integrate seamlessly while preserving backward compatibility and a simple Python SDK. This builds on the existing plugin architecture in `pkg/runtime/framework/plugins/torch/` and extends the `BuiltinTrainer` pattern in the SDK.

**The framework will provide:**

- A backend-agnostic LLM Trainer interface, symmetric to TrainingRuntime on the control plane
- Dynamic backend registration for in-tree and external frameworks
- TorchTune refactored as a first-class pluggable backend
- Faster day-0/day-1 support for new models and fine-tuning strategies
- Backward compatibility for existing TorchTune-based workflows

**Initial backends to explore:**

| Backend | Rationale |
|---------|-----------|
| TorchTune | Preserve existing functionality |
| TRL | Industry standard for SFT/DPO/PPO |
| Unsloth | ~2× faster, ~70% lower memory |
| LlamaFactory | 100+ model support |

Beyond in-tree backends, the SDK should support external framework registration, mirroring how TrainingRuntime enables custom runtimes.

This project is well-suited for contributors interested in ML systems, API design, and bridging modern LLM tooling with production Kubernetes platforms.

Tracking issue: [kubeflow/trainer#2839](https://github.com/kubeflow/trainer/issues/2839)

**Difficulty:** Hard

**Size:** 350 hours (Large)

**Skills Required/Preferred:**
* Python, Go
* Familiarity with Kubernetes and Kubeflow Trainer architecture
* Experience with LLM fine-tuning frameworks (TRL, TorchTune, Unsloth)
* Understanding of distributed training concepts
* Interest in API and framework design