feat(ci): daytime human-access deployment scheduler (#1281)#1587
feat(ci): daytime human-access deployment scheduler (#1281)#1587njhensley wants to merge 1 commit into
Conversation
|
🌿 Preview your docs: https://nvidia-preview-ci-dc8-daytime-human-access.docs.buildwithfern.com/aicr |
📝 WalkthroughWalkthroughThis PR adds a daytime human-access UAT rotation. It introduces Estimated code review effort: 4 (Complex) | ~45 minutes Possibly related PRs
Suggested labels: Suggested reviewers: 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/uat-daytime.yaml:
- Around line 36-39: The cron expressions in the `schedule` block and the
matching `MORNING_UP_CRON`/`EVENING_DOWN_CRON` env values must stay identical to
keep the Resolve-action string comparison working. Update the `uat-daytime`
workflow so the inline comment near `schedule` and the env definitions clearly
tie the two together, and make sure any future change to one cron is mirrored in
the other. Use the `on.schedule` entries and the `MORNING_UP_CRON` /
`EVENING_DOWN_CRON` symbols as the reference points when editing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7151dfff-08f5-459a-9c3a-fa962f1ca642
📒 Files selected for processing (9)
.github/workflows/uat-daytime.yamldocs/contributor/uat.mdinfra/uat/reservations.yamlpkg/uatbroker/model.gopkg/uatbroker/registry.gopkg/uatbroker/registry_test.gotools/uat-broker/README.mdtools/uat-broker/main.gotools/uat-broker/main_test.go
2d56967 to
5189b82
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/uatbroker/registry.go (1)
66-110: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winEnforce the one-daytime-reservation-per-cloud invariant in
Validate(), not just via a test on the committed file.Docs (
docs/contributor/uat.md) state this as a hard invariant, butValidate()only checks thatdaytime-intentis a recognized value — it never checks that at most one reservation per cloud opts in. Today that's caught only byTestCommittedRegistryValidagainst the specific checked-ininfra/uat/reservations.yaml; any other caller ofParseRegistry/LoadRegistryFile(future tooling, alternate registries, other tests) can silently violate it, undermining the lease/scheduler assumption that one cloud never hosts two daytime clusters at once.🛡️ Proposed fix
if res.DaytimeIntent != "" && !validIntents[res.DaytimeIntent] { return errors.New(errors.ErrCodeInvalidRequest, fmt.Sprintf("reservation %s has unknown daytime-intent %q (want %s or %s, or empty to opt out)", res.Name, res.DaytimeIntent, IntentTraining, IntentInference)) } } + daytimeCloud := make(map[string]string, len(r.Reservations)) + for i := range r.Reservations { + res := &r.Reservations[i] + if res.DaytimeIntent == "" { + continue + } + if prev, ok := daytimeCloud[res.Cloud]; ok { + return errors.New(errors.ErrCodeInvalidRequest, + fmt.Sprintf("cloud %s has more than one daytime-intent reservation (%s and %s); at most one is allowed", + res.Cloud, prev, res.Name)) + } + daytimeCloud[res.Cloud] = res.Name + } return nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/uatbroker/registry.go` around lines 66 - 110, `Registry.Validate()` currently checks individual reservation fields but does not enforce the documented one-daytime-reservation-per-cloud invariant. Add a per-cloud check in `Validate()` (alongside the existing `seen`/`validIntents` logic) that tracks reservations with a non-empty `DaytimeIntent` and returns an `errors.ErrCodeInvalidRequest` if more than one reservation on the same `Cloud` opts into daytime rotation. Keep the existing `Reservation` field validation intact, and use the `Registry`, `Reservation`, `CloudAWS`, `CloudGCP`, and `DaytimeIntent` symbols to locate the logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/uatbroker/registry.go`:
- Around line 66-110: `Registry.Validate()` currently checks individual
reservation fields but does not enforce the documented
one-daytime-reservation-per-cloud invariant. Add a per-cloud check in
`Validate()` (alongside the existing `seen`/`validIntents` logic) that tracks
reservations with a non-empty `DaytimeIntent` and returns an
`errors.ErrCodeInvalidRequest` if more than one reservation on the same `Cloud`
opts into daytime rotation. Keep the existing `Reservation` field validation
intact, and use the `Registry`, `Reservation`, `CloudAWS`, `CloudGCP`, and
`DaytimeIntent` symbols to locate the logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: a75d749c-90c0-4e1f-a345-5888749010ca
📒 Files selected for processing (9)
.github/workflows/uat-daytime.yamldocs/contributor/uat.mdinfra/uat/reservations.yamlpkg/uatbroker/model.gopkg/uatbroker/registry.gopkg/uatbroker/registry_test.gotools/uat-broker/README.mdtools/uat-broker/main.gotools/uat-broker/main_test.go
Implements DC8 (NVIDIA#1281): stand up one long-lived, human-facing deployment per cloud for the working day, then tear it down before the nightly batch. DC2 (NVIDIA#1586) already owns the daytime-up/daytime-down lifecycle, provision- and-hold, teardown, and pre-batch guard; DC8 adds the orchestration on top — the cloud→flavor data mapping, the scheduler, and the out-of-band access path. Cloud→flavor split (data, not code): add an optional daytime-intent column to the reservation registry (aws-h100=training, gcp-h100=inference at launch; empty = nightly-batch only). pkg/uatbroker gains the DaytimeIntent field, intent constants + validation, and DaytimeAssignments(); uat-broker gains a `reservations --daytime` JSON matrix output. Only one daytime reservation per cloud is allowed (a reservation cannot hold both a held daytime cluster and the nightly batch at once) — enforced by the committed-registry test. Scheduler (uat-daytime.yaml): a thin scheduler over the daytime-up/daytime-down mechanics. A morning cron dispatches daytime-up per rotation reservation, an evening cron (before the 04:00 batch) dispatches daytime-down; a manual workflow_dispatch(action=up|down) covers ad-hoc runs. Each dispatch routes through uat-run.yaml so it takes the same per-reservation lease as the batch, and is watched to completion so a failed handoff/teardown surfaces. The daytime cluster is not a UAT cell — daytime-up stops after deploy, emitting no evidence bundle and no TestGrid column. Docs (docs/contributor/uat.md): document the cloud→flavor split, the scheduler and its cron edges, missed-teardown recovery via DC2's pre-batch guard, and the out-of-band access path — stable cluster names gated by cloud IAM so no credential transits CI, with TrainJob submission (AWS) and the Dynamo OpenAI endpoint port-forward (GCP). The served DynamoGraphDeployment remains DC3's phase_serve; until it lands the served workload is a documented manual apply. Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
5189b82 to
a0e9407
Compare
|
Addressed the 🟠 Major finding on |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/uat.md`:
- Line 79: The docs text in uat.md should not imply the one-per-cloud invariant
is only enforced by the uatbroker committed-registry test; update the wording to
state that Registry.Validate() enforces it as a runtime contract for
ParseRegistry and LoadRegistryFile callers, with the test only serving as
coverage. Keep the mention of the launch split and daytime-intent constraint,
but make the enforcement source explicit using the Registry.Validate symbol.
In `@tools/uat-broker/main.go`:
- Around line 158-165: The daytime JSON path in main's `daytime` branch writes
directly via `json.Encoder.Encode`, which can partially stream to stdout before
an error is returned. Update that branch to build the full payload first
(similar to the line-oriented path below, using a buffered approach in `main`),
then perform a single write so broken pipes or unwritable output surfaces as a
failure before any truncated JSON is emitted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: ead88c8d-7dfd-4e2b-9895-e70fc4804046
📒 Files selected for processing (9)
.github/workflows/uat-daytime.yamldocs/contributor/uat.mdinfra/uat/reservations.yamlpkg/uatbroker/model.gopkg/uatbroker/registry.gopkg/uatbroker/registry_test.gotools/uat-broker/README.mdtools/uat-broker/main.gotools/uat-broker/main_test.go
| | `aws-h100` | AWS | `training` | training stack (Kubeflow `TrainJob`s) | | ||
| | `gcp-h100` | GCP | `inference` | inference stack (Dynamo, OpenAI-compatible endpoint) | | ||
|
|
||
| Re-splitting (or adding a daytime reservation) is a registry edit — no workflow change. Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. The `uatbroker` committed-registry test enforces the one-per-cloud invariant and the launch split. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Doc understates where the one-per-cloud invariant is enforced.
The invariant is enforced in Registry.Validate() itself (applies to every ParseRegistry/LoadRegistryFile caller), not only by the "committed-registry test." As worded, this could mislead readers into thinking the guarantee is test-only rather than a runtime contract of the registry API.
📝 Proposed wording fix
-Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. The `uatbroker` committed-registry test enforces the one-per-cloud invariant and the launch split.
+Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. `Registry.Validate()` enforces the one-per-cloud invariant for every registry parsed or loaded (not just the committed file); the `uatbroker` committed-registry test additionally pins the launch split.As per coding guidelines, **/*.md: "Keep code comments and documentation accurate and helpful, and update API documentation when endpoints change."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Re-splitting (or adding a daytime reservation) is a registry edit — no workflow change. Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. The `uatbroker` committed-registry test enforces the one-per-cloud invariant and the launch split. | |
| Re-splitting (or adding a daytime reservation) is a registry edit — no workflow change. Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. `Registry.Validate()` enforces the one-per-cloud invariant for every registry parsed or loaded (not just the committed file); the `uatbroker` committed-registry test additionally pins the launch split. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/contributor/uat.md` at line 79, The docs text in uat.md should not imply
the one-per-cloud invariant is only enforced by the uatbroker committed-registry
test; update the wording to state that Registry.Validate() enforces it as a
runtime contract for ParseRegistry and LoadRegistryFile callers, with the test
only serving as coverage. Keep the mention of the launch split and
daytime-intent constraint, but make the enforcement source explicit using the
Registry.Validate symbol.
Source: Coding guidelines
| if daytime { | ||
| enc := json.NewEncoder(stdout) | ||
| enc.SetIndent("", " ") | ||
| if err := enc.Encode(reg.DaytimeAssignments()); err != nil { | ||
| return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err) | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value
Buffer the JSON output before writing, for consistency with the line-oriented path.
enc.Encode streams directly to stdout as it marshals, whereas the line-oriented branch right below explicitly builds output in a strings.Builder first "so a broken pipe or an unwritable $GITHUB_OUTPUT surfaces as a failure instead of a silent exit-0." The --daytime path doesn't get that same guarantee — a write failure partway through encoding could emit truncated JSON before the non-zero exit code is observed.
♻️ Proposed fix using MarshalIndent + single write
if daytime {
- enc := json.NewEncoder(stdout)
- enc.SetIndent("", " ")
- if err := enc.Encode(reg.DaytimeAssignments()); err != nil {
- return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err)
- }
- return nil
+ out, err := json.MarshalIndent(reg.DaytimeAssignments(), "", " ")
+ if err != nil {
+ return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err)
+ }
+ if _, err := stdout.Write(append(out, '\n')); err != nil {
+ return errors.Wrap(errors.ErrCodeInternal, "write daytime assignments", err)
+ }
+ return nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if daytime { | |
| enc := json.NewEncoder(stdout) | |
| enc.SetIndent("", " ") | |
| if err := enc.Encode(reg.DaytimeAssignments()); err != nil { | |
| return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err) | |
| } | |
| return nil | |
| } | |
| if daytime { | |
| out, err := json.MarshalIndent(reg.DaytimeAssignments(), "", " ") | |
| if err != nil { | |
| return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err) | |
| } | |
| if _, err := stdout.Write(append(out, '\n')); err != nil { | |
| return errors.Wrap(errors.ErrCodeInternal, "write daytime assignments", err) | |
| } | |
| return nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/uat-broker/main.go` around lines 158 - 165, The daytime JSON path in
main's `daytime` branch writes directly via `json.Encoder.Encode`, which can
partially stream to stdout before an error is returned. Update that branch to
build the full payload first (similar to the line-oriented path below, using a
buffered approach in `main`), then perform a single write so broken pipes or
unwritable output surfaces as a failure before any truncated JSON is emitted.
Summary
Implements DC8 (#1281): a thin scheduler that stands up one long-lived, human-facing deployment per cloud each working day (AWS=training, GCP=inference), then tears it down before the nightly batch — reusing DC2's
daytime-up/daytime-downlifecycle.Motivation / Context
At the morning handoff the day's scarce reservation time is split between human use (DC8) and the nightly version matrix (DC5). DC2 (#1586) landed the daytime provision-and-hold / teardown / pre-batch-guard mechanics; DC8 adds the orchestration on top — the cloud→flavor data mapping, the morning/evening scheduler, and the out-of-band access path. The daytime cluster is human-facing, not a UAT cell: it emits no evidence bundle and no TestGrid column.
Fixes: #1281
Related: #1264 (epic), #1586 (DC2), #1274 (DC1)
Type of Change
Component(s) Affected
docs/,examples/)pkg/uatbroker,tools/uat-broker), CI workflows (.github/workflows), reservation registry (infra/uat)Implementation Notes
daytime-intentcolumn on each reservation row (infra/uat/reservations.yaml) opts a reservation into the daytime rotation and picks its flavor. Launch default:aws-h100=training,gcp-h100=inference; empty = nightly-batch only. Re-splitting is a registry edit.pkg/uatbrokergainsDaytimeIntent+ intent constants/validation +DaytimeAssignments();uat-broker reservations --daytimeemits a[{reservation,intent}]JSON matrix. Only one daytime reservation per cloud is allowed (a reservation cannot hold both a held daytime cluster and the nightly batch at once) — enforced by the committed-registry test.uat-daytime.yaml). Morning cron0 15 * * *→daytime-up, evening cron0 2 * * *→daytime-down(~2h before the0 4batch), plus manualworkflow_dispatch(action=up|down). Each dispatch routes throughuat-run.yaml, so it takes the same per-reservation lease as the batch and is watched to completion so a failed handoff/teardown surfaces.daytime-downis skipped/fails, DC2's pre-batch guard blocks the next batch rather than racing the held cluster.TrainJobs (AWS) or port-forward the Dynamo OpenAI endpoint (GCP).DynamoGraphDeploymentremains DC3'sphase_serve;daytime-updeploys the inference platform and the served workload is a documented one-command manual apply until DC3 lands.Testing
The scheduler workflow itself is exercised by the manual
workflow_dispatchacceptance path described in the issue (provisions the daytime clusters, holds them, evening teardown releases the lease); it cannot be run in unit tests as it drives real cloud provisioning.Risk Assessment
Rollout notes: The daytime rotation activates once the
uat-daytime.yamlcrons run onmain. Backwards compatible — reservations withoutdaytime-intentare unaffected (nightly batch only).Checklist
make testwith-race)make lint)git commit -S)