gsoc: add kubeflow sdk mcp as gsoc 2026 project idea#4290
gsoc: add kubeflow sdk mcp as gsoc 2026 project idea#4290google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
Conversation
|
/area gsoc |
|
@jaiakash: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Pull request overview
This PR proposes adding "MCP Server for Kubeflow SDK" as a project idea for Google Summer of Code 2026. The project aims to create a Model Context Protocol (MCP) server to enhance developer experience by enabling LLM-powered interaction with Kubeflow SDK training jobs, including status monitoring, debugging, and natural language interaction capabilities.
Changes:
- Added a new GSoC 2026 project proposal for implementing an MCP server for Kubeflow SDK
- Defined project scope including training log exposure, job management CRUD operations, LLM-enhanced validation, and telemetry analysis
- Specified project difficulty as Medium with either 175 or 350 hours commitment
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
andreyvelich
left a comment
There was a problem hiding this comment.
/cc @akshaychitneni @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-trainer-team
|
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-sdk-team, kubeflow/kubeflow-trainer-team. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@jaiakash I would like to co-mentor this project with you. |
a959cea to
93c2674
Compare
|
Hi @jaiakash , May I opt for co-mentoring on these efforts ? |
|
|
||
| **Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk), [kubeflow/trainer](https://www.github.com/kubeflow/trainer) | ||
|
|
||
| **Mentors:** [TBD], [@jaiakash](https://www.github.com/jaiakash) |
There was a problem hiding this comment.
Please add @dhanishaphadate and @abhijeet-dhumal as co-mentors.
There was a problem hiding this comment.
Hi @dhanishaphadate @abhijeet-dhumal thanks for volunteering, added you both as co mentor. Looking forward for a successful journey ahead.
|
|
||
| Most of us do use LLMs to create/debug code for jobs, models, etc., but as of now there is no mechanism for the LLM to see the `TrainJob` status, debug a crash loop or provide consolidate metric about our previous task. We want to extend and improve the Developer Experience (DX) with the introduction of a **Model Context Protocol (MCP)** server for Kubeflow ecosystem. | ||
|
|
||
| This project aims to initiate the Kubeflow MCP project with first-principles features. The MCP server will help with training jobs, their lifecycle, and associated observability data, enabling natural-language interaction through MCP-compatible clients. Tracking issue: [kubeflow/sdk#238](https://github.com/kubeflow/sdk/issues/238) |
There was a problem hiding this comment.
you need to update the scope for this project.
There was a problem hiding this comment.
Hi @abhijeet-dhumal @dhanishaphadate please review the scope, as of now i have kept it open ended. We need some good ideas for initial draft before merging.
We REALLY need to merge it asap, but exactly defined scope and features we can always update during next phase of GSoC timeline.
There was a problem hiding this comment.
Thanks for the feedback @andreyvelich! Here's a suggested scope based on the existing MVP and KEP-936:
Core Deliverables (Required):
- MCP tools for TrainJob lifecycle:
fine_tune(),get_training_job(),list_training_jobs(),delete_training_job() - Pre-flight validation:
get_cluster_resources(),estimate_resources(),check_training_prerequisites() - Job observability:
get_training_logs(),get_job_events() - Storage setup:
setup_training_storage()(PVC creation for checkpoints)
Stretch Goals:
- Policy-based access control (persona-based RBAC)
- Custom trainer support via
run_custom_training()andrun_container_job() - Integration with Model Registry MCP catalog
- Checkpoint listing (requires spawning helper pods)
- Progress tracking (KEp-2779)
@jaiakash let me know if this aligns with your vision for the project scope.
There was a problem hiding this comment.
Looks good to me, will add in PR.
There was a problem hiding this comment.
Look good to me as well.
There was a problem hiding this comment.
Job observability: get_training_logs(), get_job_events()
As we have added this , I can scope #4294 project.
d5926fd to
6e9c363
Compare
|
|
||
| **Contributor:** | ||
|
|
||
| **Details:** |
There was a problem hiding this comment.
Small suggestion for the Details section - we could add more context about what already exists:
We already have a KEP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components.
This sets clearer expectations that contributors will be building on existing work rather than starting from scratch.
but not a hard opinion, this can be skipped if not needed
There was a problem hiding this comment.
Yeah, can you share a draft for that. Would be helpful.
There was a problem hiding this comment.
Is this draft looks good? :
Details:
The Kubeflow SDK allows users with limited Kubernetes knowledge to use standard Python APIs to interact with the Kubeflow ecosystem. Documentation(https://sdk.kubeflow.org/en/latest/index.html)
Most of us use LLMs to create/debug code for jobs, models, etc., but currently there is no mechanism for the LLM to see TrainJob status, debug a crash loop, or provide consolidated metrics about previous tasks. We want to extend and improve the Developer Experience (DX) with a Model Context Protocol (MCP) server for the Kubeflow ecosystem.
We have a KEP(https://github.com/kubeflow/community/issues/936) and existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.
Core Deliverables:
- MCP tools for TrainJob lifecycle (fine_tune, get_training_job, list_training_jobs, delete_training_job)
- Pre-flight validation (get_cluster_resources, estimate_resources, check_training_prerequisites)
- Job observability (get_training_logs, get_job_events)
- Storage setup (setup_training_storage)
Stretch Goals:
- Policy-based access control (persona-based RBAC)
- Custom trainer support (run_custom_training, run_container_job)
- Integration with Model Registry MCP catalog
- Progress tracking (pending KEP-2779)
Tracking issue: https://github.com/kubeflow/sdk/issues/238
There was a problem hiding this comment.
Looks good to me, @dhanishaphadate what about you?
There was a problem hiding this comment.
Yes, adding details about what currently exists is a good idea. Once the initial work is complete and the foundation is in place, we can add those details later when the project is published on the website.
|
Thank you @jaiakash and @dhanishaphadate !! Excited to work on this together! 🎉 |
|
cc @andreyvelich @abhijeet-dhumal @dhanishaphadate |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ### Project 2: MCP Server for Kubeflow SDK | ||
|
|
||
| **Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk), [kubeflow/trainer](https://www.github.com/kubeflow/trainer) |
There was a problem hiding this comment.
The URLs use https://www.github.com/ but the majority of GitHub links in this file use https://github.com/ without the "www" subdomain (e.g., lines 79, 116, 146, 159, 161, 178, 192, 194). For consistency with the established convention in this file, remove "www." from these URLs.
|
|
||
| **Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk), [kubeflow/trainer](https://www.github.com/kubeflow/trainer) | ||
|
|
||
| **Mentors:** [@jaiakash](https://www.github.com/jaiakash), [@dhanishaphadate](https://www.github.com/dhanishaphadate), [@abhijeet-dhumal](https://www.github.com/abhijeet-dhumal) |
There was a problem hiding this comment.
The mentor GitHub URLs use https://www.github.com/ but should use https://github.com/ without the "www" subdomain for consistency with other mentor links (see line 178 in Project 3).
|
|
||
| Most of us use LLMs to create/debug code for jobs, models, etc., but currently there is no mechanism for the LLM to see TrainJob status, debug a crash loop, or provide consolidated metrics about previous tasks. We want to extend and improve the Developer Experience (DX) with a Model Context Protocol (MCP) server for the Kubeflow ecosystem. | ||
|
|
||
| We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry. |
There was a problem hiding this comment.
Missing article before "existing". The sentence should read "We have a [kubeflow/community#936]...and an existing MVP for this project."
| We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry. | |
| We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry. |
| * Experience with LLM / MCP development. | ||
| * Familiarity with the Kubeflow SDK and Trainer codebase. | ||
| * Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts. | ||
| * Engage and contribute to Kubeflow community on Slack and GitHub. |
There was a problem hiding this comment.
The bullet point style uses asterisks (*) while Project 1 uses hyphens (-) for its "Skills Required/Preferred" list (lines 126-130). For consistency within the document, consider using hyphens instead of asterisks.
| * Familiarity with the Kubeflow SDK and Trainer codebase. | ||
| * Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts. | ||
| * Engage and contribute to Kubeflow community on Slack and GitHub. | ||
| --- |
There was a problem hiding this comment.
This horizontal rule separator is inconsistent with the document structure. Project 1 does not have a separator after it, and Project 3 presumably doesn't either. This separator should be removed to maintain consistency with the other project entries.
|
/lgtm |
|
Please rebase after #4286. the deadline is close @andreyvelich |
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
ef7299d to
a50f3ac
Compare
Done, Please review @andreyvelich @abhijeet-dhumal @dhanishaphadate |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| Most of us use LLMs to create/debug code for jobs, models, etc., but currently there is no mechanism for the LLM to see TrainJob status, debug a crash loop, or provide consolidated metrics about previous tasks. We want to extend and improve the Developer Experience (DX) with a Model Context Protocol (MCP) server for the Kubeflow ecosystem. | ||
|
|
||
| We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry. |
There was a problem hiding this comment.
The reference format is inconsistent with markdown conventions. The link text "kubeflow/community#936" should be a descriptive text, not a repository reference notation. Consider changing this to match the pattern used elsewhere in the document, such as: "We have an open issue tracking this work and an existing MVP for this project."
| We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry. | |
| We have an [open tracking issue](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry. |
| - Integration with Model Registry MCP catalog | ||
| - Progress tracking (pending [KEP-937](https://github.com/kubeflow/community/pull/937)) | ||
|
|
||
| Tracking issue: https://github.com/kubeflow/sdk/issues/238 |
There was a problem hiding this comment.
The tracking issue link format is inconsistent with other projects in this file. Project 2 uses the format [kubeflow/katib#2605](https://github.com/kubeflow/katib/issues/2605) (line 152), but this project uses a plain URL. For consistency, consider formatting this as: Tracking issue: [kubeflow/sdk#238](https://github.com/kubeflow/sdk/issues/238)
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
This PR proposes "MCP for Kubeflow SDK: as project idea for GSoC 2026.
Related Issue: kubeflow/sdk#238