Skip to content

fix(hami-scheduler): compatible with app's stop fast behavior#2712

Merged
eball merged 1 commit intomainfrom
gpu/fix/app_stop_compat
Mar 18, 2026
Merged

fix(hami-scheduler): compatible with app's stop fast behavior#2712
eball merged 1 commit intomainfrom
gpu/fix/app_stop_compat

Conversation

@dkeven
Copy link
Member

@dkeven dkeven commented Mar 18, 2026

  • Background
    After appservice: stop app fast if pod was hami schudule failed when resume #2699, an Application that's reported unschedulable by hami-scheduler will be stopped immediately by app-service, however, hami-scheduler also reports unschedulable and make kube-sheduler retry scheduling in many retryable cases, such as node locked by another pod. Also, the asynchronous nature of HAMi's informer may lead to device occupation stats not updated immediately, causing pod to be scheduled only in the next retry. Two changes have been made to make HAMi compatible with this new logic:
    1.Add a new event type reasoned as InsufficientGPU that's dedicated to the case when no available GPU resources can be found for the to-be scheduled pod, separating from other normal retryable cases.
    2.When pod is deleted by HAMi-scheduler itself, update the in-memory device usage immediately rather than relying on the pod informer to update the state, to avoid potential race conditions with the deployment controller.

  • Target Version for Merge
    1.12.5, 1.12.6

  • Related Issues
    none

  • PRs Involving Sub-Systems
    fix(scheduler): compatible with app's stop fast behavior HAMi#17

  • Other information:
    none

@vercel
Copy link

vercel bot commented Mar 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
olares-docs Ignored Ignored Mar 18, 2026 1:47pm

Request Review

@eball eball merged commit 5f84fcb into main Mar 18, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants