An MCP (Model Context Protocol) server that exposes Kubeflow training operations as tools for AI assistants. Built with FastMCP.
⚠️ Work in Progress — This project is under active development. The current implementation covers core training tools but is not yet tested against a live Kubernetes cluster. Additional tools, modules, and tests are still being added. Expect breaking changes.
This server allows AI assistants (Claude, Cursor, etc.) to manage distributed model training on Kubernetes through natural conversation. Instead of writing SDK code manually, users describe what they want and the assistant translates that into Kubeflow operations.
Example: "Fine-tune Llama-3.2 on the Alpaca dataset with LoRA" → The assistant calls fine_tune() which handles all the TrainerClient/TorchTune setup.
kubeflow-mcp/
├── server.py # FastMCP server entry point
├── core/
│ └── resources.py # Cluster inspection (pods, nodes)
├── clients/
│ └── trainer/
│ ├── __init__.py # Module metadata
│ └── training.py # Training tools (fine_tune, custom, container)
└── pyproject.toml
| Tool | Description | Status |
|---|---|---|
get_cluster_resources() |
List pods and nodes in the cluster | ✅ Implemented |
| Tool | Description | Status |
|---|---|---|
fine_tune() |
Zero-code fine-tuning of HuggingFace models via TorchTune | ✅ Implemented |
run_custom_training() |
Run user-provided Python training code on K8s | ✅ Implemented |
run_container_training() |
Deploy a pre-built Docker image for training | ✅ Implemented |
estimate_resources() |
Estimate GPU/memory requirements for a model | 🚧 Planned |
list_training_jobs() |
List all training jobs | 🚧 Planned |
get_training_job() |
Get details of a specific job | 🚧 Planned |
get_training_logs() |
Stream logs from a training job | 🚧 Planned |
delete_training_job() |
Delete a training job | 🚧 Planned |
All training tools use a safety pattern to prevent accidental resource consumption:
- Preview — Call with
confirmed=False(default) → returns a config preview - Execute — After user approval, call with
confirmed=True→ submits the job
- Python ≥ 3.10
- Access to a Kubernetes cluster with Kubeflow installed
~/.kube/configconfigured
| Package | Version | Purpose |
|---|---|---|
fastmcp |
≥ 3.1.1 | MCP protocol server framework |
kubeflow |
≥ 0.3.0 | Kubeflow Training SDK |
kubernetes |
≥ 35.0.0 | Kubernetes Python client |
# Clone and install with uv
cd kubeflow-mcp
uv sync
# Run the server
uv run server.pyClaude Desktop (~/.claude/claude_desktop_config.json):
{
"mcpServers": {
"kubeflow": {
"command": "uv",
"args": ["run", "server.py"]
}
}
}Cursor (.cursor/mcp.json or Cursor Settings):
{
"mcpServers": {
"kubeflow-mcp": {
"command": "uv",
"args": [
"--directory",
"path to the project ",
"run",
"python",
"server.py"
]
}
}
}The PoC currently covers the core training workflow from the design spec:
- Server entry point (
server.py) — FastMCP server with tool registration, server-level instructions, and debug logging - Cluster inspection (
core/resources.py) —get_cluster_resources()withget_pods()andget_nodes()helpers via the Kubernetes Python client - 3 training tools (
clients/trainer/training.py):fine_tune()— BuiltinTrainer + TorchTune + LoRA configrun_custom_training()— AST validation, security checks, importlib-based func_code → Callable bridgerun_container_training()— CustomTrainerContainer deployment
- Two-phase confirmation on all training tools
- Modular package structure matching the design's
core/+clients/trainer/layout
Several enhancements were added on top of what KEP-936 DESIGN.md specifies:
| Enhancement | What | Why |
|---|---|---|
Annotated parameter metadata |
Every tool parameter uses Annotated[type, "description with examples"] |
The DESIGN.md uses bare type annotations (model: str). In MCP, these descriptions become the parameter schema the LLM sees — concrete examples like 'meta-llama/Llama-3.2-1B' help the LLM construct valid calls |
| Rich tool docstrings | Each tool has a detailed description covering purpose and return format | The DESIGN.md has developer-facing docstrings ("Internally calls: TrainerClient.train(...)"). The PoC rewrites these to be LLM-facing — describing what the tool does and what output to expect |
| Debug logging | get_logger("kubeflow-mcp.server.context.to_client") at DEBUG level |
Provides visibility into every message sent to the MCP client during development — essential for debugging tool responses and LLM behavior |
- Testing — No unit or integration tests yet
- Discovery tools (
list_training_jobs,get_training_job) - Monitoring tools (
get_training_logs,get_training_events) - Lifecycle tools (
delete,suspend,resume) -
estimate_resources()implementation - Persona-based tool filtering
- Authentication / multi-tenant support
- Error handling improvements in
core/resources.py
- KEP-936 Design Spec — Full design document for this project
- FastMCP Documentation — MCP server framework
- Kubeflow Training SDK — Underlying SDK