Skip to content

devjpt23/kubeflow-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kubeflow-mcp

An MCP (Model Context Protocol) server that exposes Kubeflow training operations as tools for AI assistants. Built with FastMCP.

⚠️ Work in Progress — This project is under active development. The current implementation covers core training tools but is not yet tested against a live Kubernetes cluster. Additional tools, modules, and tests are still being added. Expect breaking changes.

What It Does

This server allows AI assistants (Claude, Cursor, etc.) to manage distributed model training on Kubernetes through natural conversation. Instead of writing SDK code manually, users describe what they want and the assistant translates that into Kubeflow operations.

Example: "Fine-tune Llama-3.2 on the Alpaca dataset with LoRA" → The assistant calls fine_tune() which handles all the TrainerClient/TorchTune setup.

Project Structure

kubeflow-mcp/
├── server.py                        # FastMCP server entry point
├── core/
│   └── resources.py                 # Cluster inspection (pods, nodes)
├── clients/
│   └── trainer/
│       ├── __init__.py              # Module metadata
│       └── training.py              # Training tools (fine_tune, custom, container)
└── pyproject.toml

Available Tools

Core

Tool Description Status
get_cluster_resources() List pods and nodes in the cluster ✅ Implemented

Trainer

Tool Description Status
fine_tune() Zero-code fine-tuning of HuggingFace models via TorchTune ✅ Implemented
run_custom_training() Run user-provided Python training code on K8s ✅ Implemented
run_container_training() Deploy a pre-built Docker image for training ✅ Implemented
estimate_resources() Estimate GPU/memory requirements for a model 🚧 Planned
list_training_jobs() List all training jobs 🚧 Planned
get_training_job() Get details of a specific job 🚧 Planned
get_training_logs() Stream logs from a training job 🚧 Planned
delete_training_job() Delete a training job 🚧 Planned

Two-Phase Confirmation

All training tools use a safety pattern to prevent accidental resource consumption:

  1. Preview — Call with confirmed=False (default) → returns a config preview
  2. Execute — After user approval, call with confirmed=True → submits the job

Requirements

  • Python ≥ 3.10
  • Access to a Kubernetes cluster with Kubeflow installed
  • ~/.kube/config configured

Dependencies

Package Version Purpose
fastmcp ≥ 3.1.1 MCP protocol server framework
kubeflow ≥ 0.3.0 Kubeflow Training SDK
kubernetes ≥ 35.0.0 Kubernetes Python client

Setup

# Clone and install with uv
cd kubeflow-mcp
uv sync

# Run the server
uv run server.py

Client Configuration

Claude Desktop (~/.claude/claude_desktop_config.json):

{
  "mcpServers": {
    "kubeflow": {
      "command": "uv",
      "args": ["run", "server.py"]
    }
  }
}

Cursor (.cursor/mcp.json or Cursor Settings):

{
  "mcpServers": {
    "kubeflow-mcp": {
      "command": "uv",
      "args": [
        "--directory",
        "path to the project ",
        "run",
        "python",
        "server.py"
      ]
    }
  }
}

What's Implemented

The PoC currently covers the core training workflow from the design spec:

  • Server entry point (server.py) — FastMCP server with tool registration, server-level instructions, and debug logging
  • Cluster inspection (core/resources.py) — get_cluster_resources() with get_pods() and get_nodes() helpers via the Kubernetes Python client
  • 3 training tools (clients/trainer/training.py):
    • fine_tune() — BuiltinTrainer + TorchTune + LoRA config
    • run_custom_training() — AST validation, security checks, importlib-based func_code → Callable bridge
    • run_container_training() — CustomTrainerContainer deployment
  • Two-phase confirmation on all training tools
  • Modular package structure matching the design's core/ + clients/trainer/ layout

Beyond the Design Spec

Several enhancements were added on top of what KEP-936 DESIGN.md specifies:

Enhancement What Why
Annotated parameter metadata Every tool parameter uses Annotated[type, "description with examples"] The DESIGN.md uses bare type annotations (model: str). In MCP, these descriptions become the parameter schema the LLM sees — concrete examples like 'meta-llama/Llama-3.2-1B' help the LLM construct valid calls
Rich tool docstrings Each tool has a detailed description covering purpose and return format The DESIGN.md has developer-facing docstrings ("Internally calls: TrainerClient.train(...)"). The PoC rewrites these to be LLM-facing — describing what the tool does and what output to expect
Debug logging get_logger("kubeflow-mcp.server.context.to_client") at DEBUG level Provides visibility into every message sent to the MCP client during development — essential for debugging tool responses and LLM behavior

What's Still Missing

  • Testing — No unit or integration tests yet
  • Discovery tools (list_training_jobs, get_training_job)
  • Monitoring tools (get_training_logs, get_training_events)
  • Lifecycle tools (delete, suspend, resume)
  • estimate_resources() implementation
  • Persona-based tool filtering
  • Authentication / multi-tenant support
  • Error handling improvements in core/resources.py

Related

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages