Skip to content

gsoc: add kubeflow sdk mcp as gsoc 2026 project idea#4290

Merged
google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
jaiakash:gsoc26-sdk-mcp
Feb 2, 2026
Merged

gsoc: add kubeflow sdk mcp as gsoc 2026 project idea#4290
google-oss-prow[bot] merged 1 commit intokubeflow:masterfrom
jaiakash:gsoc26-sdk-mcp

Conversation

@jaiakash
Copy link
Member

This PR proposes "MCP for Kubeflow SDK: as project idea for GSoC 2026.

Related Issue: kubeflow/sdk#238

Copilot AI review requested due to automatic review settings January 26, 2026 14:24
@google-oss-prow google-oss-prow bot added the area/community AREA: Community Docs label Jan 26, 2026
@jaiakash
Copy link
Member Author

/area gsoc
/area llm
/area sdk

@google-oss-prow google-oss-prow bot added the area/gsoc AREA: Google Summer of Code label Jan 26, 2026
@google-oss-prow
Copy link

@jaiakash: The label(s) area/llm, area/sdk cannot be applied, because the repository doesn't have them.

Details

In response to this:

/area gsoc
/area llm
/area sdk

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR proposes adding "MCP Server for Kubeflow SDK" as a project idea for Google Summer of Code 2026. The project aims to create a Model Context Protocol (MCP) server to enhance developer experience by enabling LLM-powered interaction with Kubeflow SDK training jobs, including status monitoring, debugging, and natural language interaction capabilities.

Changes:

  • Added a new GSoC 2026 project proposal for implementing an MCP server for Kubeflow SDK
  • Defined project scope including training log exposure, job management CRUD operations, LLM-enhanced validation, and telemetry analysis
  • Specified project difficulty as Medium with either 175 or 350 hours commitment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @akshaychitneni @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-trainer-team

@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-sdk-team, kubeflow/kubeflow-trainer-team.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @akshaychitneni @kubeflow/kubeflow-sdk-team @kubeflow/kubeflow-trainer-team

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich
Copy link
Member

cc @franciscojavierarceo

@dhanishaphadate
Copy link
Contributor

dhanishaphadate commented Jan 26, 2026

@jaiakash I would like to co-mentor this project with you.

@abhijeet-dhumal
Copy link
Member

Hi @jaiakash , May I opt for co-mentoring on these efforts ?


**Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk), [kubeflow/trainer](https://www.github.com/kubeflow/trainer)

**Mentors:** [TBD], [@jaiakash](https://www.github.com/jaiakash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add @dhanishaphadate and @abhijeet-dhumal as co-mentors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dhanishaphadate @abhijeet-dhumal thanks for volunteering, added you both as co mentor. Looking forward for a successful journey ahead.


Most of us do use LLMs to create/debug code for jobs, models, etc., but as of now there is no mechanism for the LLM to see the `TrainJob` status, debug a crash loop or provide consolidate metric about our previous task. We want to extend and improve the Developer Experience (DX) with the introduction of a **Model Context Protocol (MCP)** server for Kubeflow ecosystem.

This project aims to initiate the Kubeflow MCP project with first-principles features. The MCP server will help with training jobs, their lifecycle, and associated observability data, enabling natural-language interaction through MCP-compatible clients. Tracking issue: [kubeflow/sdk#238](https://github.com/kubeflow/sdk/issues/238)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to update the scope for this project.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abhijeet-dhumal @dhanishaphadate please review the scope, as of now i have kept it open ended. We need some good ideas for initial draft before merging.

We REALLY need to merge it asap, but exactly defined scope and features we can always update during next phase of GSoC timeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @andreyvelich! Here's a suggested scope based on the existing MVP and KEP-936:

Core Deliverables (Required):

  • MCP tools for TrainJob lifecycle: fine_tune(), get_training_job(), list_training_jobs(), delete_training_job()
  • Pre-flight validation: get_cluster_resources(), estimate_resources(), check_training_prerequisites()
  • Job observability: get_training_logs(), get_job_events()
  • Storage setup: setup_training_storage() (PVC creation for checkpoints)

Stretch Goals:

  • Policy-based access control (persona-based RBAC)
  • Custom trainer support via run_custom_training() and run_container_job()
  • Integration with Model Registry MCP catalog
  • Checkpoint listing (requires spawning helper pods)
  • Progress tracking (KEp-2779)

@jaiakash let me know if this aligns with your vision for the project scope.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, will add in PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaiakash 🙌

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job observability: get_training_logs(), get_job_events()

As we have added this , I can scope #4294 project.


**Contributor:**

**Details:**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion for the Details section - we could add more context about what already exists:

We already have a KEP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components.

This sets clearer expectations that contributors will be building on existing work rather than starting from scratch.

but not a hard opinion, this can be skipped if not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, can you share a draft for that. Would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this draft looks good? :


Details:

The Kubeflow SDK allows users with limited Kubernetes knowledge to use standard Python APIs to interact with the Kubeflow ecosystem. Documentation(https://sdk.kubeflow.org/en/latest/index.html)

Most of us use LLMs to create/debug code for jobs, models, etc., but currently there is no mechanism for the LLM to see TrainJob status, debug a crash loop, or provide consolidated metrics about previous tasks. We want to extend and improve the Developer Experience (DX) with a Model Context Protocol (MCP) server for the Kubeflow ecosystem.

We have a KEP(https://github.com/kubeflow/community/issues/936) and existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.

Core Deliverables:

- MCP tools for TrainJob lifecycle (fine_tune, get_training_job, list_training_jobs, delete_training_job)
- Pre-flight validation (get_cluster_resources, estimate_resources, check_training_prerequisites)
- Job observability (get_training_logs, get_job_events)
- Storage setup (setup_training_storage)

Stretch Goals:
- Policy-based access control (persona-based RBAC)
- Custom trainer support (run_custom_training, run_container_job)
- Integration with Model Registry MCP catalog
- Progress tracking (pending KEP-2779)

Tracking issue: https://github.com/kubeflow/sdk/issues/238

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, @dhanishaphadate what about you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding details about what currently exists is a good idea. Once the initial work is complete and the foundation is in place, we can add those details later when the project is published on the website.

@abhijeet-dhumal
Copy link
Member

Thank you @jaiakash and @dhanishaphadate !! Excited to work on this together! 🎉

@jaiakash
Copy link
Member Author

cc @andreyvelich @abhijeet-dhumal @dhanishaphadate
PTAL We can merge this now.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


### Project 2: MCP Server for Kubeflow SDK

**Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk), [kubeflow/trainer](https://www.github.com/kubeflow/trainer)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URLs use https://www.github.com/ but the majority of GitHub links in this file use https://github.com/ without the "www" subdomain (e.g., lines 79, 116, 146, 159, 161, 178, 192, 194). For consistency with the established convention in this file, remove "www." from these URLs.

Copilot uses AI. Check for mistakes.

**Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk), [kubeflow/trainer](https://www.github.com/kubeflow/trainer)

**Mentors:** [@jaiakash](https://www.github.com/jaiakash), [@dhanishaphadate](https://www.github.com/dhanishaphadate), [@abhijeet-dhumal](https://www.github.com/abhijeet-dhumal)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mentor GitHub URLs use https://www.github.com/ but should use https://github.com/ without the "www" subdomain for consistency with other mentor links (see line 178 in Project 3).

Copilot uses AI. Check for mistakes.

Most of us use LLMs to create/debug code for jobs, models, etc., but currently there is no mechanism for the LLM to see TrainJob status, debug a crash loop, or provide consolidated metrics about previous tasks. We want to extend and improve the Developer Experience (DX) with a Model Context Protocol (MCP) server for the Kubeflow ecosystem.

We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing article before "existing". The sentence should read "We have a [kubeflow/community#936]...and an existing MVP for this project."

Suggested change
We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.
We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +171
* Experience with LLM / MCP development.
* Familiarity with the Kubeflow SDK and Trainer codebase.
* Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts.
* Engage and contribute to Kubeflow community on Slack and GitHub.
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bullet point style uses asterisks (*) while Project 1 uses hyphens (-) for its "Skills Required/Preferred" list (lines 126-130). For consistency within the document, consider using hyphens instead of asterisks.

Copilot uses AI. Check for mistakes.
* Familiarity with the Kubeflow SDK and Trainer codebase.
* Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts.
* Engage and contribute to Kubeflow community on Slack and GitHub.
---
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This horizontal rule separator is inconsistent with the document structure. Project 1 does not have a separator after it, and Project 3 presumably doesn't either. This separator should be removed to maintain consistency with the other project entries.

Copilot uses AI. Check for mistakes.
@abhijeet-dhumal
Copy link
Member

/lgtm
Nit pick : Some of the co-pilot reviews seems legit, wdyt @jaiakash ? 👀

@juliusvonkohout
Copy link
Member

Please rebase after #4286. the deadline is close @andreyvelich

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
@jaiakash
Copy link
Member Author

jaiakash commented Feb 2, 2026

Please rebase after #4286. the deadline is close @andreyvelich

Done, Please review @andreyvelich @abhijeet-dhumal @dhanishaphadate

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


Most of us use LLMs to create/debug code for jobs, models, etc., but currently there is no mechanism for the LLM to see TrainJob status, debug a crash loop, or provide consolidated metrics about previous tasks. We want to extend and improve the Developer Experience (DX) with a Model Context Protocol (MCP) server for the Kubeflow ecosystem.

We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference format is inconsistent with markdown conventions. The link text "kubeflow/community#936" should be a descriptive text, not a repository reference notation. Consider changing this to match the pattern used elsewhere in the document, such as: "We have an open issue tracking this work and an existing MVP for this project."

Suggested change
We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.
We have an [open tracking issue](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.

Copilot uses AI. Check for mistakes.
- Integration with Model Registry MCP catalog
- Progress tracking (pending [KEP-937](https://github.com/kubeflow/community/pull/937))

Tracking issue: https://github.com/kubeflow/sdk/issues/238
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracking issue link format is inconsistent with other projects in this file. Project 2 uses the format [kubeflow/katib#2605](https://github.com/kubeflow/katib/issues/2605) (line 152), but this project uses a plain URL. For consistency, consider formatting this as: Tracking issue: [kubeflow/sdk#238](https://github.com/kubeflow/sdk/issues/238)

Copilot uses AI. Check for mistakes.
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 168a82a into kubeflow:master Feb 2, 2026
12 of 13 checks passed
Shekharrajak pushed a commit to Shekharrajak/website that referenced this pull request Feb 2, 2026
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
ederign pushed a commit to ederign/website that referenced this pull request Feb 2, 2026
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved area/community AREA: Community Docs area/gsoc AREA: Google Summer of Code lgtm size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants