-
Notifications
You must be signed in to change notification settings - Fork 948
feat(docs): Kubeflow Trainer ROADMAP 2026 #3242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
09ef71e
2afb5c5
4e99cf7
86dd497
e7245e7
b7e7858
4ec3787
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,44 @@ | ||
| # Kubeflow Trainer ROADMAP | ||
|
|
||
| ## 2026 | ||
|
|
||
| - Scheduling & Scalability | ||
| - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015 | ||
| - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628 | ||
| - Support Multi-Node NVLink (MNNVL) for TrainJob: https://github.com/kubeflow/trainer/issues/3264 | ||
| - First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for | ||
| multi-cluster job dispatching, topology-aware scheduling, and other features. | ||
| - Enhanced Scalability for Massively Distributed TrainJobs: https://github.com/kubeflow/trainer/issues/2318 | ||
| - MPI and HPC on Kubernetes | ||
| - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841 | ||
| - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807 | ||
| - PMIx Investigation with Flux or Slurm plugins | ||
| - Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751 | ||
| - Observability & Reliability | ||
| - TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779 | ||
| - Transparent Checkpoint/Restore for GPU-Accelerated TrainJobs: https://github.com/kubeflow/trainer/issues/2245 | ||
| - TTLs and ActiveDeadlineSeconds for TrainJobs: https://github.com/kubeflow/trainer/issues/2899 | ||
| - Elastic TrainJobs: https://github.com/kubeflow/trainer/issues/2903 | ||
| - Add controller-level Prometheus metrics and ServiceMonitor: https://github.com/kubeflow/trainer/issues/3429 | ||
| - Default Grafana dashboard for Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/3430 | ||
| - Distributed Data Cache | ||
| - Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173 | ||
| - Integration with OptimizationJob | ||
| - Explore RDMA with AI Schedulers and Data Cache | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RDMA will work nicely with MPI, although I suspect you are thinking of some of the Google products for GPU.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've been chatting with @EkinKarabulut and @akshaychitneni about how we can leverage the Data Cache feature for RDMA. With the appropriate topology placement, data can be transferred directly to GPU nodes using zero-copy. We believe that this could be highly beneficial for advanced AI data centres using GPUs like the GB200. We can also explore how our MPI support might be helpful in this context. |
||
| - LLM Fine-Tuning Enhancements | ||
| - Automatic configuration of GPU requests for TrainJobs: https://github.com/kubeflow/trainer/issues/3328 | ||
| - Build Dynamic BuiltinTrainers and LLM Fine-Tuning Blueprints: https://github.com/kubeflow/trainer/issues/2839 | ||
| - New Kubeflow Trainer Runtimes | ||
| - Distributed JAX: https://github.com/kubeflow/trainer/issues/2442 | ||
| - Distributed XGBoost: https://github.com/kubeflow/trainer/issues/2598 | ||
| - Tensor Parallelism with Megatron-LM: https://github.com/kubeflow/trainer/issues/3178 | ||
| - Slurm Runtime Integration: https://github.com/kubeflow/trainer/issues/2249 | ||
| - Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML | ||
| frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750 | ||
| - Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648 | ||
| - Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one is specifically interesting to me! We are working on agentic, state machine orchestration, and I already ran a study to deploy Flux MiniClusters in Kubernetes for MPI applications. It would be really easy to extend this to a Kubeflow Trainer spec! |
||
| - Enhance lifecycle management and mutability of Runtimes: https://github.com/kubeflow/trainer/pull/3428 | ||
|
|
||
| ## 2025 | ||
|
|
||
| - Kubeflow Trainer v2 general availability: https://github.com/kubeflow/trainer/issues/2170 | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Flux supports Intel MPI and PMIX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the issues stated in #2751 are not relevant with Flux.