From 09ef71eb923ea7dd3399bd11aa561da4397cf844 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Mon, 23 Feb 2026 23:52:05 +0000 Subject: [PATCH 1/7] feat(docs): Kubeflow Trainer ROADMAP 2026 Signed-off-by: Andrey Velichkevich --- ROADMAP.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/ROADMAP.md b/ROADMAP.md index dec8321a3f..53bbbc8b1b 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -1,5 +1,40 @@ # Kubeflow Trainer ROADMAP +## 2026 + +- Distributed AI Scheduling Enhancements + - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015 + - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628 + - Enhanced Multi-Node NVLink Support + - First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for + multi-cluster job dispatching, topology-aware scheduling, and other features. +- MPI and HPC on Kubernetes + - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841 + - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807 + - PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12 + - Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751 +- Observability and Reliability + - TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779 + - Transparent Checkpoint/Restore for GPU-Accelerated TrainJobs: https://github.com/kubeflow/trainer/issues/2245 + - TTLs and ActiveDeadlineSeconds for TrainJobs: https://github.com/kubeflow/trainer/issues/2899 + - Elastic TrainJobs: https://github.com/kubeflow/trainer/issues/2903 +- Distributed Data Cache + - Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173 + - Integration with OptimizationJob + - Explore RDMA with AI Schedulers and Data Cache +- LLM Fine-Tuning Enhancements + - Predictive GPU Capacity Planning for TrainJobs + - Build Dynamic BuiltinTrainers and LLM Fine-Tuning Blueprints: https://github.com/kubeflow/trainer/issues/2839 +- New Kubeflow Trainer Runtimes + - Distributed JAX: https://github.com/kubeflow/trainer/issues/2442 + - Distributed XGBoost: https://github.com/kubeflow/trainer/issues/2598 + - Tensor Parallelism with Megatron-LM: https://github.com/kubeflow/trainer/issues/3178 + - Slurm Runtime Integration: https://github.com/kubeflow/trainer/issues/2249 +- Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML + frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750 +- Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648 +- Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238 + ## 2025 - Kubeflow Trainer v2 general availability: https://github.com/kubeflow/trainer/issues/2170 From 2afb5c5d1e922d1ad001bf195ac66474371244d5 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Tue, 24 Feb 2026 19:05:58 +0000 Subject: [PATCH 2/7] Add Scalability feature Signed-off-by: Andrey Velichkevich --- ROADMAP.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/ROADMAP.md b/ROADMAP.md index 53bbbc8b1b..7cb5dfb6ed 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -2,18 +2,19 @@ ## 2026 -- Distributed AI Scheduling Enhancements +- Scheduling & Scalability - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015 - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628 - Enhanced Multi-Node NVLink Support - First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for multi-cluster job dispatching, topology-aware scheduling, and other features. + - Enhanced Scalability for Massively Distributed TrainJobs: https://github.com/kubeflow/trainer/issues/2318 - MPI and HPC on Kubernetes - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841 - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807 - PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12 - Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751 -- Observability and Reliability +- Observability & Reliability - TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779 - Transparent Checkpoint/Restore for GPU-Accelerated TrainJobs: https://github.com/kubeflow/trainer/issues/2245 - TTLs and ActiveDeadlineSeconds for TrainJobs: https://github.com/kubeflow/trainer/issues/2899 From 4e99cf7a41544a793bb3e65acb5b2cc93d291e4d Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Mon, 2 Mar 2026 16:03:28 +0000 Subject: [PATCH 3/7] Add issue for Multi-Node NVLink Signed-off-by: Andrey Velichkevich --- ROADMAP.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ROADMAP.md b/ROADMAP.md index 7cb5dfb6ed..8fd53e254a 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -5,14 +5,14 @@ - Scheduling & Scalability - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015 - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628 - - Enhanced Multi-Node NVLink Support + - Support Multi-Node NVLink (MNNVL) for TrainJob: https://github.com/kubeflow/trainer/issues/3264 - First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for multi-cluster job dispatching, topology-aware scheduling, and other features. - Enhanced Scalability for Massively Distributed TrainJobs: https://github.com/kubeflow/trainer/issues/2318 - MPI and HPC on Kubernetes - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841 - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807 - - PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12 + - PMIx Investigation with Flux or Slurm plugins - Enhance MPI Orchestration: https://github.com/kubeflow/trainer/issues/2751 - Observability & Reliability - TrainJob Progress Tracking & Metrics Exposure: https://github.com/kubeflow/trainer/issues/2779 From 86dd4973226090002d756ca66291d81e35b867e4 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Tue, 17 Mar 2026 00:31:03 +0000 Subject: [PATCH 4/7] Update ROADMAP.md Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Signed-off-by: Andrey Velichkevich --- ROADMAP.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ROADMAP.md b/ROADMAP.md index 8fd53e254a..979aca69d6 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -24,7 +24,7 @@ - Integration with OptimizationJob - Explore RDMA with AI Schedulers and Data Cache - LLM Fine-Tuning Enhancements - - Predictive GPU Capacity Planning for TrainJobs + - Predictive GPU Capacity Planning for TrainJobs: https://github.com/kubeflow/trainer/issues/3328 - Build Dynamic BuiltinTrainers and LLM Fine-Tuning Blueprints: https://github.com/kubeflow/trainer/issues/2839 - New Kubeflow Trainer Runtimes - Distributed JAX: https://github.com/kubeflow/trainer/issues/2442 From e7245e73f41ad495b9bb08a3e2fc540fecc72fd4 Mon Sep 17 00:00:00 2001 From: Andrey Velichkevich Date: Fri, 20 Mar 2026 17:40:50 +0000 Subject: [PATCH 5/7] Update ROADMAP.md Co-authored-by: Vassilis Vassiliadis Signed-off-by: Andrey Velichkevich --- ROADMAP.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ROADMAP.md b/ROADMAP.md index 979aca69d6..6c20d0d4be 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -24,7 +24,7 @@ - Integration with OptimizationJob - Explore RDMA with AI Schedulers and Data Cache - LLM Fine-Tuning Enhancements - - Predictive GPU Capacity Planning for TrainJobs: https://github.com/kubeflow/trainer/issues/3328 + - Automatic configuration of GPU requests for TrainJobs: https://github.com/kubeflow/trainer/issues/3328 - Build Dynamic BuiltinTrainers and LLM Fine-Tuning Blueprints: https://github.com/kubeflow/trainer/issues/2839 - New Kubeflow Trainer Runtimes - Distributed JAX: https://github.com/kubeflow/trainer/issues/2442 From b7e785890935fda55b9a550edeb609549d879f86 Mon Sep 17 00:00:00 2001 From: andreyvelich Date: Wed, 15 Apr 2026 11:42:11 +0100 Subject: [PATCH 6/7] Add item for Runtime lifecycle management Signed-off-by: andreyvelich --- ROADMAP.md | 1 + 1 file changed, 1 insertion(+) diff --git a/ROADMAP.md b/ROADMAP.md index 6c20d0d4be..6458fa064b 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -35,6 +35,7 @@ frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750 - Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648 - Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238 +- Enhance lifecycle management and mutability of Runtimes: https://github.com/kubeflow/trainer/pull/3428 ## 2025 From 4ec37872337ec2dca924bee7b08a4c27e98ea837 Mon Sep 17 00:00:00 2001 From: andreyvelich Date: Wed, 15 Apr 2026 11:47:33 +0100 Subject: [PATCH 7/7] Add Observability items Signed-off-by: andreyvelich --- ROADMAP.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ROADMAP.md b/ROADMAP.md index 6458fa064b..79b9230179 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -19,6 +19,8 @@ - Transparent Checkpoint/Restore for GPU-Accelerated TrainJobs: https://github.com/kubeflow/trainer/issues/2245 - TTLs and ActiveDeadlineSeconds for TrainJobs: https://github.com/kubeflow/trainer/issues/2899 - Elastic TrainJobs: https://github.com/kubeflow/trainer/issues/2903 + - Add controller-level Prometheus metrics and ServiceMonitor: https://github.com/kubeflow/trainer/issues/3429 + - Default Grafana dashboard for Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/3430 - Distributed Data Cache - Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173 - Integration with OptimizationJob