Skip to content

Stabilize SMP priority_order selftest#5

Merged
jserv merged 1 commit into
mainfrom
fix
May 11, 2026
Merged

Stabilize SMP priority_order selftest#5
jserv merged 1 commit into
mainfrom
fix

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 11, 2026

The SMP build-and-test job hung on selftest 32 (priority_order) past the 120s watcher budget and was killed.

  • The make watcher's CHECK_*_TIMEOUT used := so the CI workflow's CHECK_SELFTEST_TIMEOUT=240 env override never applied; the effective budget stayed at the 120s default. Switch to ?= so the override propagates.
  • test_priority_order waited for the three pinned worker tasks via a single sleep_ms(50). That wake path arms one callout and depends on a same-priority same-hart re-enqueue to drag the idle thread out of wfi: which is fragile under QEMU TCG SMP timing and has been observed to stall the test indefinitely. Replace the blind sleep with an atomic done-counter polled via sleep_ms(0) yields and a 2s wall-clock deadline, so the test makes progress on its own and fails cleanly if the workers never run.
  • dl_replenish_cb only re-enqueued the task when it found it in TD_STATE_DL_THROTTLED, missing the case where the task is already sitting in pcpu_dl_runq[cpu] in TD_STATE_READY (sched_dl_pick_next was skipping it because dl_throttled was set). Without that poke the owning hart can stay parked in wfi after the throttle flag clears. Set need_resched on the task's CPU (and cross-hart IPI) when we observe READY at replenish time.

Summary by cubic

Stabilizes the SMP priority_order selftest and fixes a deadline scheduler wakeup gap to prevent CI hangs and ensure replenished tasks run promptly. Also makes CI selftest timeout overrides work.

  • Bug Fixes
    • Makefile: switch CHECK_TIMEOUT and CHECK_SELFTEST_TIMEOUT to ?= so CI env overrides apply.
    • Selftest: replace fixed sleep_ms(50) with an atomic done counter polled via sleep_ms(0) yields and a 2s deadline to avoid SMP stalls and fail cleanly if workers never run.
    • Scheduler: in dl_replenish_cb, when a task is READY, set need_resched on its CPU (and send IPI if needed) so it’s picked after replenishment; previously only handled DL_THROTTLED.

Written for commit b87ac2b. Summary will update on new commits.

The SMP build-and-test job hung on selftest 32 (priority_order) past the
120s watcher budget and was killed.
- The make watcher's CHECK_*_TIMEOUT used := so the CI workflow's
  CHECK_SELFTEST_TIMEOUT=240 env override never applied; the effective
  budget stayed at the 120s default. Switch to ?= so the override
  propagates.
- test_priority_order waited for the three pinned worker tasks via a
  single sleep_ms(50). That wake path arms one callout and depends on a
  same-priority same-hart re-enqueue to drag the idle thread out of wfi:
  which is fragile under QEMU TCG SMP timing and has been observed to
  stall the test indefinitely. Replace the blind sleep with an atomic
  done-counter polled via sleep_ms(0) yields and a 2s wall-clock
  deadline, so the test makes progress on its own and fails cleanly if
  the workers never run.
- dl_replenish_cb only re-enqueued the task when it found it in
  TD_STATE_DL_THROTTLED, missing the case where the task is already
  sitting in pcpu_dl_runq[cpu] in TD_STATE_READY (sched_dl_pick_next was
  skipping it because dl_throttled was set). Without that poke the
  owning hart can stay parked in wfi after the throttle flag clears. Set
  need_resched on the task's CPU (and cross-hart IPI) when we observe
  READY at replenish time.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

@jserv jserv merged commit b3c2000 into main May 11, 2026
7 checks passed
@jserv jserv deleted the fix branch May 11, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant