Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,44 @@
# Kubeflow Trainer ROADMAP

## 2026

- Scheduling & Scalability
- Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015
- KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628
- Support Multi-Node NVLink (MNNVL) for TrainJob: https://github.com/kubeflow/trainer/issues/3264
- First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for
multi-cluster job dispatching, topology-aware scheduling, and other features.
- Enhanced Scalability for Massively Distributed TrainJobs: https://github.com/kubeflow/trainer/issues/2318
- MPI and HPC on Kubernetes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Flux supports Intel MPI and PMIX.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the issues stated in #2751 are not relevant with Flux.

- Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841
- IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807
- PMIx Investigation with Flux or Slurm plugins
- Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751
- Observability & Reliability
- TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779
- Transparent Checkpoint/Restore for GPU-Accelerated TrainJobs: https://github.com/kubeflow/trainer/issues/2245
- TTLs and ActiveDeadlineSeconds for TrainJobs: https://github.com/kubeflow/trainer/issues/2899
- Elastic TrainJobs: https://github.com/kubeflow/trainer/issues/2903
- Add controller-level Prometheus metrics and ServiceMonitor: https://github.com/kubeflow/trainer/issues/3429
- Default Grafana dashboard for Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/3430
- Distributed Data Cache
- Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173
- Integration with OptimizationJob
- Explore RDMA with AI Schedulers and Data Cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDMA will work nicely with MPI, although I suspect you are thinking of some of the Google products for GPU.

Copy link
Copy Markdown
Member Author

@andreyvelich andreyvelich Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been chatting with @EkinKarabulut and @akshaychitneni about how we can leverage the Data Cache feature for RDMA. With the appropriate topology placement, data can be transferred directly to GPU nodes using zero-copy.

We believe that this could be highly beneficial for advanced AI data centres using GPUs like the GB200.

We can also explore how our MPI support might be helpful in this context.

- LLM Fine-Tuning Enhancements
- Automatic configuration of GPU requests for TrainJobs: https://github.com/kubeflow/trainer/issues/3328
- Build Dynamic BuiltinTrainers and LLM Fine-Tuning Blueprints: https://github.com/kubeflow/trainer/issues/2839
- New Kubeflow Trainer Runtimes
- Distributed JAX: https://github.com/kubeflow/trainer/issues/2442
- Distributed XGBoost: https://github.com/kubeflow/trainer/issues/2598
- Tensor Parallelism with Megatron-LM: https://github.com/kubeflow/trainer/issues/3178
- Slurm Runtime Integration: https://github.com/kubeflow/trainer/issues/2249
- Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML
frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750
- Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648
- Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is specifically interesting to me! We are working on agentic, state machine orchestration, and I already ran a study to deploy Flux MiniClusters in Kubernetes for MPI applications. It would be really easy to extend this to a Kubeflow Trainer spec!

- Enhance lifecycle management and mutability of Runtimes: https://github.com/kubeflow/trainer/pull/3428

## 2025

- Kubeflow Trainer v2 general availability: https://github.com/kubeflow/trainer/issues/2170
Expand Down
Loading