Skip to content

fix(scheduler): compatible with app's stop fast behavior#17

Merged
dkeven merged 2 commits intofeat/nvsharefrom
scheduler/fix/appsvc_stopfast_compat
Mar 18, 2026
Merged

fix(scheduler): compatible with app's stop fast behavior#17
dkeven merged 2 commits intofeat/nvsharefrom
scheduler/fix/appsvc_stopfast_compat

Conversation

@dkeven
Copy link
Member

@dkeven dkeven commented Mar 18, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

After beclab/Olares#2699, an Application that's reported unschedulable by hami-scheduler will be stopped immediately by app-service, however, hami-scheduler also reports unschedulable and make kube-sheduler retry scheduling in many retryable cases, such as node locked by another pod. Also, the asynchronous nature of HAMi's informer may lead to device occupation stats not updated immediately, causing pod to be scheduled only in the next retry. Two changes have been made to make HAMi compatible with this new logic:

  1. Add a new event type reasoned as InsufficientGPU that's dedicated to the case when no available GPU resources can be found for the to-be scheduled pod, separating from other normal retryable cases.
  2. When pod is deleted by HAMi-scheduler itself, update the in-memory device usage immediately rather than relying on the pod informer to update the state, to avoid potential race conditions with the deployment controller.

@dkeven dkeven merged commit f62448f into feat/nvshare Mar 18, 2026
1 check passed
@dkeven dkeven deleted the scheduler/fix/appsvc_stopfast_compat branch March 18, 2026 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant