kubeflow-mcp

An MCP (Model Context Protocol) server that exposes Kubeflow training operations as tools for AI assistants. Built with FastMCP.

⚠️ Work in Progress — This project is under active development. The current implementation covers core training tools but is not yet tested against a live Kubernetes cluster. Additional tools, modules, and tests are still being added. Expect breaking changes.

What It Does

This server allows AI assistants (Claude, Cursor, etc.) to manage distributed model training on Kubernetes through natural conversation. Instead of writing SDK code manually, users describe what they want and the assistant translates that into Kubeflow operations.

Example: "Fine-tune Llama-3.2 on the Alpaca dataset with LoRA" → The assistant calls fine_tune() which handles all the TrainerClient/TorchTune setup.

Project Structure

kubeflow-mcp/
├── server.py                        # FastMCP server entry point
├── core/
│   └── resources.py                 # Cluster inspection (pods, nodes)
├── clients/
│   └── trainer/
│       ├── __init__.py              # Module metadata
│       └── training.py              # Training tools (fine_tune, custom, container)
└── pyproject.toml

Available Tools

Core

Tool	Description	Status
`get_cluster_resources()`	List pods and nodes in the cluster	✅ Implemented

Trainer

Tool	Description	Status
`fine_tune()`	Zero-code fine-tuning of HuggingFace models via TorchTune	✅ Implemented
`run_custom_training()`	Run user-provided Python training code on K8s	✅ Implemented
`run_container_training()`	Deploy a pre-built Docker image for training	✅ Implemented
`estimate_resources()`	Estimate GPU/memory requirements for a model	🚧 Planned
`list_training_jobs()`	List all training jobs	🚧 Planned
`get_training_job()`	Get details of a specific job	🚧 Planned
`get_training_logs()`	Stream logs from a training job	🚧 Planned
`delete_training_job()`	Delete a training job	🚧 Planned

Two-Phase Confirmation

All training tools use a safety pattern to prevent accidental resource consumption:

Preview — Call with confirmed=False (default) → returns a config preview
Execute — After user approval, call with confirmed=True → submits the job

Requirements

Python ≥ 3.10
Access to a Kubernetes cluster with Kubeflow installed
~/.kube/config configured

Dependencies

Package	Version	Purpose
`fastmcp`	≥ 3.1.1	MCP protocol server framework
`kubeflow`	≥ 0.3.0	Kubeflow Training SDK
`kubernetes`	≥ 35.0.0	Kubernetes Python client

Setup

# Clone and install with uv
cd kubeflow-mcp
uv sync

# Run the server
uv run server.py

Client Configuration

Claude Desktop (~/.claude/claude_desktop_config.json):

{
  "mcpServers": {
    "kubeflow": {
      "command": "uv",
      "args": ["run", "server.py"]
    }
  }
}

Cursor (.cursor/mcp.json or Cursor Settings):

{
  "mcpServers": {
    "kubeflow-mcp": {
      "command": "uv",
      "args": [
        "--directory",
        "path to the project ",
        "run",
        "python",
        "server.py"
      ]
    }
  }
}

What's Implemented

The PoC currently covers the core training workflow from the design spec:

Server entry point (server.py) — FastMCP server with tool registration, server-level instructions, and debug logging
Cluster inspection (core/resources.py) — get_cluster_resources() with get_pods() and get_nodes() helpers via the Kubernetes Python client
3 training tools (clients/trainer/training.py):
- fine_tune() — BuiltinTrainer + TorchTune + LoRA config
- run_custom_training() — AST validation, security checks, importlib-based func_code → Callable bridge
- run_container_training() — CustomTrainerContainer deployment
Two-phase confirmation on all training tools
Modular package structure matching the design's core/ + clients/trainer/ layout

Beyond the Design Spec

Several enhancements were added on top of what KEP-936 DESIGN.md specifies:

Enhancement	What	Why
`Annotated` parameter metadata	Every tool parameter uses `Annotated[type, "description with examples"]`	The DESIGN.md uses bare type annotations (`model: str`). In MCP, these descriptions become the parameter schema the LLM sees — concrete examples like `'meta-llama/Llama-3.2-1B'` help the LLM construct valid calls
Rich tool docstrings	Each tool has a detailed description covering purpose and return format	The DESIGN.md has developer-facing docstrings ("Internally calls: TrainerClient.train(...)"). The PoC rewrites these to be LLM-facing — describing what the tool does and what output to expect
Debug logging	`get_logger("kubeflow-mcp.server.context.to_client")` at DEBUG level	Provides visibility into every message sent to the MCP client during development — essential for debugging tool responses and LLM behavior

What's Still Missing

Testing — No unit or integration tests yet
Discovery tools (list_training_jobs, get_training_job)
Monitoring tools (get_training_logs, get_training_events)
Lifecycle tools (delete, suspend, resume)
estimate_resources() implementation
Persona-based tool filtering
Authentication / multi-tenant support
Error handling improvements in core/resources.py

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
clients		clients
core		core
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kubeflow-mcp

What It Does

Project Structure

Available Tools

Core

Trainer

Two-Phase Confirmation

Requirements

Dependencies

Setup

Client Configuration

What's Implemented

Beyond the Design Spec

What's Still Missing

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kubeflow-mcp

What It Does

Project Structure

Available Tools

Core

Trainer

Two-Phase Confirmation

Requirements

Dependencies

Setup

Client Configuration

What's Implemented

Beyond the Design Spec

What's Still Missing

Related

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages