feat(BA-3435): Implement Rolling Update deployment strategy#9567
feat(BA-3435): Implement Rolling Update deployment strategy#9567jopemachine wants to merge 3 commits intoBA-4821from
Conversation
|
|
||
| total_new_live = len(new_provisioning) + len(new_healthy) | ||
|
|
||
| # ── 2. PROVISIONING: wait for in-flight routes ── |
There was a problem hiding this comment.
[MEDIUM] UNHEALTHY/DEGRADED routes classified as healthy may cause premature completion
At lines 65-66 of the file, the fallthrough elif r.status.is_active(): new_healthy.append(r) classifies UNHEALTHY and DEGRADED new routes into the new_healthy bucket. This means the completion check at line 80 (len(new_healthy) >= desired) can mark the deployment as completed even when all new routes are UNHEALTHY or DEGRADED.
Scenario:
desired=2, 2 new routes both inUNHEALTHYstatus, 0 old routesnew_healthy = [UNHEALTHY_1, UNHEALTHY_2](via theis_active()fallthrough)len(new_healthy) >= desired= True = completed
This is tested and documented (test_unhealthy_new_counts_as_healthy, test_degraded_new_counts_as_healthy), so it appears intentional. However, it could lead to a deployment being marked as successfully completed when no route is actually serving traffic well.
Suggestion: Consider whether the completion check should require RouteStatus.HEALTHY specifically (not just is_active()), or add a comment explaining why UNHEALTHY/DEGRADED routes are acceptable for completion. The variable name new_healthy is also misleading since it contains non-healthy routes -- renaming to new_active or new_live would improve clarity.
| log.warning( | ||
| "deployment {}: all {} new routes failed — rolling back", | ||
| deployment.id, | ||
| len(new_failed), |
There was a problem hiding this comment.
[MEDIUM] Rollback path unreachable in production due to upstream query filter
The rollback check at this line (total_new_live == 0 and new_failed) requires routes with FAILED_TO_START or TERMINATED status to be present. However, the evaluator layer calls fetch_active_routes_by_endpoint_ids() (in evaluator.py line 66), which filters to only active_route_statuses() = {PROVISIONING, HEALTHY, UNHEALTHY, DEGRADED}.
Since FAILED_TO_START and TERMINATED are excluded by the DB query, new_failed will always be empty in production, making this rollback path dead code at the system level.
Impact: When all new routes fail, the evaluator will see zero new routes and zero failed routes. The FSM falls through to step 5 (PROGRESSING), computes total_new_live = 0, still_needed = desired, and creates new routes -- effectively retrying infinitely instead of rolling back.
Suggestion: This is an integration-level concern (not specific to this PR's pure function). Consider either:
- Expanding
fetch_active_routes_by_endpoint_ids()to includeFAILED_TO_STARTstatus for deployment evaluation - Adding a separate
fetch_failed_routes_by_endpoint_ids()call in the evaluator - Documenting that rollback detection is intentionally deferred to a different mechanism
Note: This same concern applies to the blue-green strategy.
| max_unavailable = spec.max_unavailable | ||
|
|
||
| # Total pods allowed at peak = desired + max_surge | ||
| max_total = desired + max_surge |
There was a problem hiding this comment.
[LOW] No validation prevents max_surge=0, max_unavailable=0 deadlock
RollingUpdateSpec (in models/deployment_policy/row.py) has no validator preventing both max_surge=0 and max_unavailable=0. This configuration creates a deadlock where the FSM returns PROGRESSING with zero route changes every cycle -- no new routes can be created and no old routes can be terminated.
The test test_surge_0_unavailable_0_deadlock documents this as a known issue. However, this will result in a deployment that appears to be actively deploying but makes zero progress indefinitely.
Suggestion: Add a Pydantic model_validator to RollingUpdateSpec that rejects max_surge=0 and max_unavailable=0 simultaneously:
@model_validator(mode='after')
def validate_progress_possible(self) -> RollingUpdateSpec:
if self.max_surge == 0 and self.max_unavailable == 0:
raise ValueError(
'At least one of max_surge or max_unavailable must be > 0 '
'to allow rolling update progress'
)
return selfAlternatively, if this is intentionally allowed (e.g., as a 'pause' mechanism), add a comment in the spec class explaining the design choice.
|
|
||
| # ── 3. Completed: all old replaced, enough new healthy ── | ||
| if not old_active and len(new_healthy) >= desired: | ||
| log.info( |
There was a problem hiding this comment.
[LOW] PROVISIONING check blocks all progress -- consider batching
When any new route is in PROVISIONING state (step 2), the FSM immediately returns without evaluating whether additional new routes should be created or old routes should be terminated. This means the FSM can only have one 'batch' of PROVISIONING routes at a time.
For example, with max_surge=3 and desired=10:
- Cycle 1: Creates 3 new routes (all PROVISIONING)
- Cycle 2: Waits (PROVISIONING)
- Cycle 3 (after 3 become HEALTHY): Creates 3 more, terminates old
- Cycle 4: Waits again...
This serializes the rolling update into sequential waves rather than allowing overlapping operations. This is a safe design choice (prevents over-provisioning and simplifies state management), but it means the rolling update will take significantly more cycles than theoretically necessary.
This is not a bug -- just a design observation worth documenting. If faster rollouts are desired in the future, the FSM could be modified to allow creating additional routes while some are still provisioning, as long as the total stays within the surge budget.
Security & Performance Review SummaryPR #9567: feat(B-3435): Implement Rolling Update deployment strategy Overall AssessmentThe rolling update FSM implementation is well-structured as a pure function with O(n) route classification and no DB access -- an excellent architectural choice. The test coverage is comprehensive (14 test classes, ~50 test cases) and the surge/unavailable budget calculations are mathematically correct for all tested scenarios. No CRITICAL or HIGH severity issues were found. The code does not introduce any security vulnerabilities, injection risks, or performance regressions. Findings
Detailed AnalysisLogic Correctness (Surge/Unavailable Budget): FSM Stuck States: UNHEALTHY/DEGRADED Classification: Missing Edge Case Coverage in Tests:
Positive Observations
|
dbe2396 to
2ca587d
Compare
resolves #7384 (BA-3435)
Overview
Implements the Rolling Update deployment strategy (BEP-1049) — the evaluator + sub-step handler pattern that gradually replaces old-revision routes with new-revision routes.
Architecture
Cycle-by-Cycle Example (
desired=3, max_surge=1, max_unavailable=1)Completion Flow
Key Types
CycleEvaluationResultstrategy/types.pyRouteChangesstrategy/types.pyEvaluationResultstrategy/types.pyRollingUpdateSpecmodels/deployment_policy.pyDeploymentHandlerKeycoordinator.pyDeploymentLifecycleType | (Type, SubStep)Changed Files
strategy/rolling_update.pytest_rolling_update.pyChecklist: (if applicable)
ai.backend.testdocsdirectory📚 Documentation preview 📚: https://sorna--9567.org.readthedocs.build/en/9567/
📚 Documentation preview 📚: https://sorna-ko--9567.org.readthedocs.build/ko/9567/
📚 Documentation preview 📚: https://sorna--9567.org.readthedocs.build/en/9567/
📚 Documentation preview 📚: https://sorna-ko--9567.org.readthedocs.build/ko/9567/