Skip to content

feat(ci): daytime human-access deployment scheduler (#1281)#1587

Open
njhensley wants to merge 1 commit into
NVIDIA:mainfrom
njhensley:ci/dc8-daytime-human-access
Open

feat(ci): daytime human-access deployment scheduler (#1281)#1587
njhensley wants to merge 1 commit into
NVIDIA:mainfrom
njhensley:ci/dc8-daytime-human-access

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Implements DC8 (#1281): a thin scheduler that stands up one long-lived, human-facing deployment per cloud each working day (AWS=training, GCP=inference), then tears it down before the nightly batch — reusing DC2's daytime-up/daytime-down lifecycle.

Motivation / Context

At the morning handoff the day's scarce reservation time is split between human use (DC8) and the nightly version matrix (DC5). DC2 (#1586) landed the daytime provision-and-hold / teardown / pre-batch-guard mechanics; DC8 adds the orchestration on top — the cloud→flavor data mapping, the morning/evening scheduler, and the out-of-band access path. The daytime cluster is human-facing, not a UAT cell: it emits no evidence bundle and no TestGrid column.

Fixes: #1281
Related: #1264 (epic), #1586 (DC2), #1274 (DC1)

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Build/CI/tooling
  • Documentation update

Component(s) Affected

  • Docs/examples (docs/, examples/)
  • Other: UAT reservation broker (pkg/uatbroker, tools/uat-broker), CI workflows (.github/workflows), reservation registry (infra/uat)

Implementation Notes

  • Cloud→flavor split is data, not code. An optional daytime-intent column on each reservation row (infra/uat/reservations.yaml) opts a reservation into the daytime rotation and picks its flavor. Launch default: aws-h100=training, gcp-h100=inference; empty = nightly-batch only. Re-splitting is a registry edit.
  • Broker. pkg/uatbroker gains DaytimeIntent + intent constants/validation + DaytimeAssignments(); uat-broker reservations --daytime emits a [{reservation,intent}] JSON matrix. Only one daytime reservation per cloud is allowed (a reservation cannot hold both a held daytime cluster and the nightly batch at once) — enforced by the committed-registry test.
  • Scheduler (uat-daytime.yaml). Morning cron 0 15 * * *daytime-up, evening cron 0 2 * * *daytime-down (~2h before the 0 4 batch), plus manual workflow_dispatch(action=up|down). Each dispatch routes through uat-run.yaml, so it takes the same per-reservation lease as the batch and is watched to completion so a failed handoff/teardown surfaces.
  • Missed teardown fails closed. If daytime-down is skipped/fails, DC2's pre-batch guard blocks the next batch rather than racing the held cluster.
  • Out-of-band access. No kubeconfig/endpoint transits CI. Stable cluster names are gated by cloud IAM; operators mint their own kubeconfig, submit TrainJobs (AWS) or port-forward the Dynamo OpenAI endpoint (GCP).
  • Served workload scope. The served DynamoGraphDeployment remains DC3's phase_serve; daytime-up deploys the inference platform and the served workload is a documented one-command manual apply until DC3 lands.

Testing

go test -race ./pkg/uatbroker/... ./tools/uat-broker/...   # ok — new funcs 100% covered
golangci-lint run -c .golangci.yaml ./pkg/uatbroker/... ./tools/uat-broker/...  # 0 issues
actionlint -shellcheck= .github/workflows/uat-daytime.yaml # ok
yamllint infra/uat/reservations.yaml .github/workflows/uat-daytime.yaml  # ok

The scheduler workflow itself is exercised by the manual workflow_dispatch acceptance path described in the issue (provisions the daytime clusters, holds them, evening teardown releases the lease); it cannot be run in unit tests as it drives real cloud provisioning.

Risk Assessment

  • Low — Additive: a new optional registry column, a new scheduler workflow, and broker output. No existing lifecycle behavior changes; reverting is removing the workflow + column.

Rollout notes: The daytime rotation activates once the uat-daytime.yaml crons run on main. Backwards compatible — reservations without daytime-intent are unaffected (nightly batch only).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@njhensley njhensley requested review from a team as code owners July 2, 2026 01:53
@njhensley njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a daytime human-access UAT rotation. It introduces daytime-intent on reservations, validates and enumerates daytime assignments in pkg/uatbroker, adds a uat-broker reservations --daytime mode, and uses that output in a new uat-daytime.yaml workflow that dispatches uat-run.yaml per reservation. The reservations data and contributor documentation are updated to describe the daytime model and operator flow.

Estimated code review effort: 4 (Complex) | ~45 minutes

Possibly related PRs

  • NVIDIA/aicr#1559: Extends the same pkg/uatbroker and tools/uat-broker reservation plumbing with daytime-intent support and daytime rotation output.
  • NVIDIA/aicr#1583: Also classifies pending-cancelled UAT runs as a benign supersede case and updates the UAT contributor docs around that behavior.
  • NVIDIA/aicr#1586: Adds the intent and lifecycle workflow plumbing that uat-daytime.yaml passes into uat-run.yaml.

Suggested labels: area/tests

Suggested reviewers: mchmarny

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the new daytime human-access deployment scheduler and matches the main change set.
Description check ✅ Passed The description accurately summarizes the scheduler, registry, broker, workflow, and docs changes in the pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/uat-daytime.yaml:
- Around line 36-39: The cron expressions in the `schedule` block and the
matching `MORNING_UP_CRON`/`EVENING_DOWN_CRON` env values must stay identical to
keep the Resolve-action string comparison working. Update the `uat-daytime`
workflow so the inline comment near `schedule` and the env definitions clearly
tie the two together, and make sure any future change to one cron is mirrored in
the other. Use the `on.schedule` entries and the `MORNING_UP_CRON` /
`EVENING_DOWN_CRON` symbols as the reference points when editing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7151dfff-08f5-459a-9c3a-fa962f1ca642

📥 Commits

Reviewing files that changed from the base of the PR and between 55fd14f and 2d56967.

📒 Files selected for processing (9)
  • .github/workflows/uat-daytime.yaml
  • docs/contributor/uat.md
  • infra/uat/reservations.yaml
  • pkg/uatbroker/model.go
  • pkg/uatbroker/registry.go
  • pkg/uatbroker/registry_test.go
  • tools/uat-broker/README.md
  • tools/uat-broker/main.go
  • tools/uat-broker/main_test.go

Comment thread .github/workflows/uat-daytime.yaml
@njhensley njhensley force-pushed the ci/dc8-daytime-human-access branch from 2d56967 to 5189b82 Compare July 2, 2026 02:45

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/uatbroker/registry.go (1)

66-110: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Enforce the one-daytime-reservation-per-cloud invariant in Validate(), not just via a test on the committed file.

Docs (docs/contributor/uat.md) state this as a hard invariant, but Validate() only checks that daytime-intent is a recognized value — it never checks that at most one reservation per cloud opts in. Today that's caught only by TestCommittedRegistryValid against the specific checked-in infra/uat/reservations.yaml; any other caller of ParseRegistry/LoadRegistryFile (future tooling, alternate registries, other tests) can silently violate it, undermining the lease/scheduler assumption that one cloud never hosts two daytime clusters at once.

🛡️ Proposed fix
 		if res.DaytimeIntent != "" && !validIntents[res.DaytimeIntent] {
 			return errors.New(errors.ErrCodeInvalidRequest,
 				fmt.Sprintf("reservation %s has unknown daytime-intent %q (want %s or %s, or empty to opt out)",
 					res.Name, res.DaytimeIntent, IntentTraining, IntentInference))
 		}
 	}
+	daytimeCloud := make(map[string]string, len(r.Reservations))
+	for i := range r.Reservations {
+		res := &r.Reservations[i]
+		if res.DaytimeIntent == "" {
+			continue
+		}
+		if prev, ok := daytimeCloud[res.Cloud]; ok {
+			return errors.New(errors.ErrCodeInvalidRequest,
+				fmt.Sprintf("cloud %s has more than one daytime-intent reservation (%s and %s); at most one is allowed",
+					res.Cloud, prev, res.Name))
+		}
+		daytimeCloud[res.Cloud] = res.Name
+	}
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/uatbroker/registry.go` around lines 66 - 110, `Registry.Validate()`
currently checks individual reservation fields but does not enforce the
documented one-daytime-reservation-per-cloud invariant. Add a per-cloud check in
`Validate()` (alongside the existing `seen`/`validIntents` logic) that tracks
reservations with a non-empty `DaytimeIntent` and returns an
`errors.ErrCodeInvalidRequest` if more than one reservation on the same `Cloud`
opts into daytime rotation. Keep the existing `Reservation` field validation
intact, and use the `Registry`, `Reservation`, `CloudAWS`, `CloudGCP`, and
`DaytimeIntent` symbols to locate the logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/uatbroker/registry.go`:
- Around line 66-110: `Registry.Validate()` currently checks individual
reservation fields but does not enforce the documented
one-daytime-reservation-per-cloud invariant. Add a per-cloud check in
`Validate()` (alongside the existing `seen`/`validIntents` logic) that tracks
reservations with a non-empty `DaytimeIntent` and returns an
`errors.ErrCodeInvalidRequest` if more than one reservation on the same `Cloud`
opts into daytime rotation. Keep the existing `Reservation` field validation
intact, and use the `Registry`, `Reservation`, `CloudAWS`, `CloudGCP`, and
`DaytimeIntent` symbols to locate the logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a75d749c-90c0-4e1f-a345-5888749010ca

📥 Commits

Reviewing files that changed from the base of the PR and between 2d56967 and 5189b82.

📒 Files selected for processing (9)
  • .github/workflows/uat-daytime.yaml
  • docs/contributor/uat.md
  • infra/uat/reservations.yaml
  • pkg/uatbroker/model.go
  • pkg/uatbroker/registry.go
  • pkg/uatbroker/registry_test.go
  • tools/uat-broker/README.md
  • tools/uat-broker/main.go
  • tools/uat-broker/main_test.go

Implements DC8 (NVIDIA#1281): stand up one long-lived, human-facing deployment
per cloud for the working day, then tear it down before the nightly batch.
DC2 (NVIDIA#1586) already owns the daytime-up/daytime-down lifecycle, provision-
and-hold, teardown, and pre-batch guard; DC8 adds the orchestration on top —
the cloud→flavor data mapping, the scheduler, and the out-of-band access path.

Cloud→flavor split (data, not code): add an optional daytime-intent column
to the reservation registry (aws-h100=training, gcp-h100=inference at launch;
empty = nightly-batch only). pkg/uatbroker gains the DaytimeIntent field,
intent constants + validation, and DaytimeAssignments(); uat-broker gains a
`reservations --daytime` JSON matrix output. Only one daytime reservation per
cloud is allowed (a reservation cannot hold both a held daytime cluster and
the nightly batch at once) — enforced by the committed-registry test.

Scheduler (uat-daytime.yaml): a thin scheduler over the daytime-up/daytime-down
mechanics. A morning cron dispatches daytime-up per rotation reservation, an
evening cron (before the 04:00 batch) dispatches daytime-down; a manual
workflow_dispatch(action=up|down) covers ad-hoc runs. Each dispatch routes
through uat-run.yaml so it takes the same per-reservation lease as the batch,
and is watched to completion so a failed handoff/teardown surfaces. The daytime
cluster is not a UAT cell — daytime-up stops after deploy, emitting no evidence
bundle and no TestGrid column.

Docs (docs/contributor/uat.md): document the cloud→flavor split, the scheduler
and its cron edges, missed-teardown recovery via DC2's pre-batch guard, and the
out-of-band access path — stable cluster names gated by cloud IAM so no
credential transits CI, with TrainJob submission (AWS) and the Dynamo OpenAI
endpoint port-forward (GCP). The served DynamoGraphDeployment remains DC3's
phase_serve; until it lands the served workload is a documented manual apply.

Signed-off-by: Nathan Hensley <nhensley@nvidia.com>
@njhensley njhensley force-pushed the ci/dc8-daytime-human-access branch from 5189b82 to a0e9407 Compare July 2, 2026 02:57
@njhensley

Copy link
Copy Markdown
Member Author

Addressed the 🟠 Major finding on pkg/uatbroker/registry.go (one-daytime-reservation-per-cloud) in a0e9407: moved the invariant into Registry.Validate() — folded into the existing validation loop via a per-cloud daytimeCloud map that rejects a second daytime-intent reservation on the same cloud with ErrCodeInvalidRequest. It now holds for every caller of ParseRegistry/LoadRegistryFile, not just the committed file. Added unit cases for both the reject (two daytime reservations, same cloud) and the allow (daytime reservations across different clouds — the launch topology). go test -race + golangci-lint green.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/contributor/uat.md`:
- Line 79: The docs text in uat.md should not imply the one-per-cloud invariant
is only enforced by the uatbroker committed-registry test; update the wording to
state that Registry.Validate() enforces it as a runtime contract for
ParseRegistry and LoadRegistryFile callers, with the test only serving as
coverage. Keep the mention of the launch split and daytime-intent constraint,
but make the enforcement source explicit using the Registry.Validate symbol.

In `@tools/uat-broker/main.go`:
- Around line 158-165: The daytime JSON path in main's `daytime` branch writes
directly via `json.Encoder.Encode`, which can partially stream to stdout before
an error is returned. Update that branch to build the full payload first
(similar to the line-oriented path below, using a buffered approach in `main`),
then perform a single write so broken pipes or unwritable output surfaces as a
failure before any truncated JSON is emitted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: ead88c8d-7dfd-4e2b-9895-e70fc4804046

📥 Commits

Reviewing files that changed from the base of the PR and between 5189b82 and a0e9407.

📒 Files selected for processing (9)
  • .github/workflows/uat-daytime.yaml
  • docs/contributor/uat.md
  • infra/uat/reservations.yaml
  • pkg/uatbroker/model.go
  • pkg/uatbroker/registry.go
  • pkg/uatbroker/registry_test.go
  • tools/uat-broker/README.md
  • tools/uat-broker/main.go
  • tools/uat-broker/main_test.go

Comment thread docs/contributor/uat.md
| `aws-h100` | AWS | `training` | training stack (Kubeflow `TrainJob`s) |
| `gcp-h100` | GCP | `inference` | inference stack (Dynamo, OpenAI-compatible endpoint) |

Re-splitting (or adding a daytime reservation) is a registry edit — no workflow change. Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. The `uatbroker` committed-registry test enforces the one-per-cloud invariant and the launch split.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Doc understates where the one-per-cloud invariant is enforced.

The invariant is enforced in Registry.Validate() itself (applies to every ParseRegistry/LoadRegistryFile caller), not only by the "committed-registry test." As worded, this could mislead readers into thinking the guarantee is test-only rather than a runtime contract of the registry API.

📝 Proposed wording fix
-Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. The `uatbroker` committed-registry test enforces the one-per-cloud invariant and the launch split.
+Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. `Registry.Validate()` enforces the one-per-cloud invariant for every registry parsed or loaded (not just the committed file); the `uatbroker` committed-registry test additionally pins the launch split.

As per coding guidelines, **/*.md: "Keep code comments and documentation accurate and helpful, and update API documentation when endpoints change."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Re-splitting (or adding a daytime reservation) is a registry edit — no workflow change. Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. The `uatbroker` committed-registry test enforces the one-per-cloud invariant and the launch split.
Re-splitting (or adding a daytime reservation) is a registry edit — no workflow change. Only **one** reservation per cloud may carry a `daytime-intent` today: a single reservation cannot host both a held daytime cluster and the nightly batch at once, so *both* flavors on one cloud during the day is out of scope until more capacity lands. `Registry.Validate()` enforces the one-per-cloud invariant for every registry parsed or loaded (not just the committed file); the `uatbroker` committed-registry test additionally pins the launch split.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/contributor/uat.md` at line 79, The docs text in uat.md should not imply
the one-per-cloud invariant is only enforced by the uatbroker committed-registry
test; update the wording to state that Registry.Validate() enforces it as a
runtime contract for ParseRegistry and LoadRegistryFile callers, with the test
only serving as coverage. Keep the mention of the launch split and
daytime-intent constraint, but make the enforcement source explicit using the
Registry.Validate symbol.

Source: Coding guidelines

Comment thread tools/uat-broker/main.go
Comment on lines +158 to +165
if daytime {
enc := json.NewEncoder(stdout)
enc.SetIndent("", " ")
if err := enc.Encode(reg.DaytimeAssignments()); err != nil {
return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err)
}
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Buffer the JSON output before writing, for consistency with the line-oriented path.

enc.Encode streams directly to stdout as it marshals, whereas the line-oriented branch right below explicitly builds output in a strings.Builder first "so a broken pipe or an unwritable $GITHUB_OUTPUT surfaces as a failure instead of a silent exit-0." The --daytime path doesn't get that same guarantee — a write failure partway through encoding could emit truncated JSON before the non-zero exit code is observed.

♻️ Proposed fix using MarshalIndent + single write
 	if daytime {
-		enc := json.NewEncoder(stdout)
-		enc.SetIndent("", "  ")
-		if err := enc.Encode(reg.DaytimeAssignments()); err != nil {
-			return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err)
-		}
-		return nil
+		out, err := json.MarshalIndent(reg.DaytimeAssignments(), "", "  ")
+		if err != nil {
+			return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err)
+		}
+		if _, err := stdout.Write(append(out, '\n')); err != nil {
+			return errors.Wrap(errors.ErrCodeInternal, "write daytime assignments", err)
+		}
+		return nil
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if daytime {
enc := json.NewEncoder(stdout)
enc.SetIndent("", " ")
if err := enc.Encode(reg.DaytimeAssignments()); err != nil {
return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err)
}
return nil
}
if daytime {
out, err := json.MarshalIndent(reg.DaytimeAssignments(), "", " ")
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "encode daytime assignments", err)
}
if _, err := stdout.Write(append(out, '\n')); err != nil {
return errors.Wrap(errors.ErrCodeInternal, "write daytime assignments", err)
}
return nil
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/uat-broker/main.go` around lines 158 - 165, The daytime JSON path in
main's `daytime` branch writes directly via `json.Encoder.Encode`, which can
partially stream to stdout before an error is returned. Update that branch to
build the full payload first (similar to the line-oriented path below, using a
buffered approach in `main`), then perform a single write so broken pipes or
unwritable output surfaces as a failure before any truncated JSON is emitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/docs area/infra size/XL theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DC8 — Daytime human-access deployment

1 participant