Skip to content

feat(BA-3435): Implement Rolling Update deployment strategy#9567

Draft
jopemachine wants to merge 3 commits intoBA-4821from
BA-3838_2
Draft

feat(BA-3435): Implement Rolling Update deployment strategy#9567
jopemachine wants to merge 3 commits intoBA-4821from
BA-3838_2

Conversation

@jopemachine
Copy link
Member

@jopemachine jopemachine commented Mar 2, 2026

resolves #7384 (BA-3435)

Overview

Implements the Rolling Update deployment strategy (BEP-1049) — the evaluator + sub-step handler pattern that gradually replaces old-revision routes with new-revision routes.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│  Periodic Scheduler (deploying: 5s / 30s)                                   │
│    → DoDeploymentLifecycleEvent(lifecycle_type="deploying")                  │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  DeploymentCoordinator                                                      │
│                                                                             │
│  process_deployment_lifecycle("deploying")                                  │
│    evaluator = _deployment_evaluators[DEPLOYING]                            │
│    → _process_with_evaluator(DEPLOYING, evaluator)                          │
│                                                                             │
│  _process_with_evaluator:                                                   │
│    1. Acquire distributed lock (LOCKID_DEPLOYMENT_DEPLOYING)                │
│    2. Query all DEPLOYING-state endpoints                                   │
│    3. evaluator.evaluate(deployments) → EvaluationResult                    │
│    4. For each sub-step group:                                              │
│         handler = handlers[(DEPLOYING, sub_step)]                           │
│         handler.execute(group) → _handle_status_transitions()               │
│    5. handler.post_process() for each group                                 │
│    6. _transition_completed_deployments() for completed                     │
│                                                                             │
│  Handler map (DeploymentHandlerKey):                                        │
│    (DEPLOYING, PROVISIONING) → DeployingProvisioningHandler                 │
│    (DEPLOYING, PROGRESSING)  → DeployingProgressingHandler                  │
│    (DEPLOYING, ROLLED_BACK)  → DeployingRolledBackHandler                   │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  DeploymentStrategyEvaluator              (strategy/evaluator.py)           │
│                                                                             │
│  evaluate(deployments) → EvaluationResult                                   │
│    1. Bulk-load policies:  fetch_deployment_policies_by_endpoint_ids()      │
│       Bulk-load routes:    fetch_active_routes_by_endpoint_ids()            │
│    2. Per deployment → dispatch by policy.strategy:                         │
│         ROLLING → rolling_update_evaluate(deployment, routes, spec)         │
│    3. Collect route mutations from all CycleEvaluationResults               │
│    4. Apply in one batch:  _apply_route_changes(scale_out, scale_in)       │
│         scale_out → Creator[RoutingRow] (new routes)                        │
│         scale_in  → BatchUpdater (TERMINATING + INACTIVE)                   │
│    5. Group deployments by sub-step → EvaluationResult                      │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  rolling_update_evaluate()                (strategy/rolling_update.py)      │
│                                                                             │
│  Pure function: (DeploymentInfo, routes, RollingUpdateSpec) → CycleResult   │
│                                                                             │
│  FSM:                                                                       │
│    1. Classify routes by revision_id:                                       │
│         old_active:       revision != deploying_revision, is_active()       │
│         new_provisioning: revision == deploying_revision, PROVISIONING      │
│         new_healthy:      revision == deploying_revision, HEALTHY           │
│         new_failed:       revision == deploying_revision, FAILED/TERMINATED │
│                                                                             │
│    2. new_provisioning? ───────────────────→ PROVISIONING (wait)            │
│    3. no old + new_healthy >= desired? ────→ PROGRESSING (completed=True)   │
│    4. all new failed? ────────────────────→ ROLLED_BACK                     │
│    5. Compute surge/unavailable budget:                                     │
│         max_total    = desired + max_surge                                  │
│         min_available = desired - max_unavailable                           │
│         to_create    = min(max_total - current, desired - new_live)         │
│         to_terminate = min(healthy - min_available, old_active)             │
│       ─────────────────────────────────────→ PROGRESSING                    │
│                                              + RouteChanges(scale_out,      │
│                                                             scale_in)       │
└─────────────────────────────────────────────────────────────────────────────┘

Cycle-by-Cycle Example (desired=3, max_surge=1, max_unavailable=1)

Cycle 0 (initial)          Cycle 1 (provisioning)     Cycle 2 (1 new healthy)
Old: [■ ■ ■]               Old: [■ ■]                 Old: [■ ■]
New: []                     New: [◇]                   New: [■]
→ create 1, terminate 1    → wait (PROVISIONING)      → create 1, terminate 1
        │                          │                          │
        ▼                          ▼                          ▼
Cycle 3 (provisioning)     Cycle 4 (2 new healthy)    Cycle 5 (provisioning)
Old: [■]                    Old: [■]                   Old: []
New: [■ ◇]                 New: [■ ■]                 New: [■ ■ ◇]
→ wait (PROVISIONING)      → create 1, terminate 1    → wait (PROVISIONING)
                                   │                          │
                                   ▼                          ▼
                            Cycle 6 (completed)
                            Old: []
                            New: [■ ■ ■]
                            → completed! revision swap + DEPLOYING → READY

Legend: ■ = healthy, ◇ = provisioning

Completion Flow

rolling_update_evaluate() returns completed=True
    │
    ▼
DeploymentStrategyEvaluator collects into EvaluationResult.completed
    │
    ▼
DeploymentCoordinator._transition_completed_deployments()
  → Atomic transaction:
    1. complete_deployment_revision_swap(endpoint_ids)
       current_revision = deploying_revision, deploying_revision = NULL
    2. Lifecycle update: DEPLOYING → READY
    3. History recording
    4. Notification events

Key Types

Type Location Purpose
CycleEvaluationResult strategy/types.py Single deployment's FSM result (sub_step + completed + route_changes)
RouteChanges strategy/types.py Route mutations: scale_out (Creator) + scale_in (route IDs)
EvaluationResult strategy/types.py Aggregate: groups by sub-step + completed + skipped + errors
RollingUpdateSpec models/deployment_policy.py Config: max_surge, max_unavailable
DeploymentHandlerKey coordinator.py Handler dispatch: DeploymentLifecycleType | (Type, SubStep)

Changed Files

File Change
strategy/rolling_update.py Rolling update FSM (pure function, stub → full implementation)
test_rolling_update.py Unit tests: 14 scenarios covering all FSM branches

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--9567.org.readthedocs.build/en/9567/


📚 Documentation preview 📚: https://sorna-ko--9567.org.readthedocs.build/ko/9567/


📚 Documentation preview 📚: https://sorna--9567.org.readthedocs.build/en/9567/


📚 Documentation preview 📚: https://sorna-ko--9567.org.readthedocs.build/ko/9567/

@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component labels Mar 2, 2026
@jopemachine jopemachine changed the base branch from main to BA-4821 March 2, 2026 01:29
@jopemachine jopemachine added this to the 26.3 milestone Mar 2, 2026
@github-actions github-actions bot added the area:docs Documentations label Mar 2, 2026

total_new_live = len(new_provisioning) + len(new_healthy)

# ── 2. PROVISIONING: wait for in-flight routes ──
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] UNHEALTHY/DEGRADED routes classified as healthy may cause premature completion

At lines 65-66 of the file, the fallthrough elif r.status.is_active(): new_healthy.append(r) classifies UNHEALTHY and DEGRADED new routes into the new_healthy bucket. This means the completion check at line 80 (len(new_healthy) >= desired) can mark the deployment as completed even when all new routes are UNHEALTHY or DEGRADED.

Scenario:

  • desired=2, 2 new routes both in UNHEALTHY status, 0 old routes
  • new_healthy = [UNHEALTHY_1, UNHEALTHY_2] (via the is_active() fallthrough)
  • len(new_healthy) >= desired = True = completed

This is tested and documented (test_unhealthy_new_counts_as_healthy, test_degraded_new_counts_as_healthy), so it appears intentional. However, it could lead to a deployment being marked as successfully completed when no route is actually serving traffic well.

Suggestion: Consider whether the completion check should require RouteStatus.HEALTHY specifically (not just is_active()), or add a comment explaining why UNHEALTHY/DEGRADED routes are acceptable for completion. The variable name new_healthy is also misleading since it contains non-healthy routes -- renaming to new_active or new_live would improve clarity.

log.warning(
"deployment {}: all {} new routes failed — rolling back",
deployment.id,
len(new_failed),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Rollback path unreachable in production due to upstream query filter

The rollback check at this line (total_new_live == 0 and new_failed) requires routes with FAILED_TO_START or TERMINATED status to be present. However, the evaluator layer calls fetch_active_routes_by_endpoint_ids() (in evaluator.py line 66), which filters to only active_route_statuses() = {PROVISIONING, HEALTHY, UNHEALTHY, DEGRADED}.

Since FAILED_TO_START and TERMINATED are excluded by the DB query, new_failed will always be empty in production, making this rollback path dead code at the system level.

Impact: When all new routes fail, the evaluator will see zero new routes and zero failed routes. The FSM falls through to step 5 (PROGRESSING), computes total_new_live = 0, still_needed = desired, and creates new routes -- effectively retrying infinitely instead of rolling back.

Suggestion: This is an integration-level concern (not specific to this PR's pure function). Consider either:

  1. Expanding fetch_active_routes_by_endpoint_ids() to include FAILED_TO_START status for deployment evaluation
  2. Adding a separate fetch_failed_routes_by_endpoint_ids() call in the evaluator
  3. Documenting that rollback detection is intentionally deferred to a different mechanism

Note: This same concern applies to the blue-green strategy.

max_unavailable = spec.max_unavailable

# Total pods allowed at peak = desired + max_surge
max_total = desired + max_surge
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] No validation prevents max_surge=0, max_unavailable=0 deadlock

RollingUpdateSpec (in models/deployment_policy/row.py) has no validator preventing both max_surge=0 and max_unavailable=0. This configuration creates a deadlock where the FSM returns PROGRESSING with zero route changes every cycle -- no new routes can be created and no old routes can be terminated.

The test test_surge_0_unavailable_0_deadlock documents this as a known issue. However, this will result in a deployment that appears to be actively deploying but makes zero progress indefinitely.

Suggestion: Add a Pydantic model_validator to RollingUpdateSpec that rejects max_surge=0 and max_unavailable=0 simultaneously:

@model_validator(mode='after')
def validate_progress_possible(self) -> RollingUpdateSpec:
    if self.max_surge == 0 and self.max_unavailable == 0:
        raise ValueError(
            'At least one of max_surge or max_unavailable must be > 0 '
            'to allow rolling update progress'
        )
    return self

Alternatively, if this is intentionally allowed (e.g., as a 'pause' mechanism), add a comment in the spec class explaining the design choice.


# ── 3. Completed: all old replaced, enough new healthy ──
if not old_active and len(new_healthy) >= desired:
log.info(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] PROVISIONING check blocks all progress -- consider batching

When any new route is in PROVISIONING state (step 2), the FSM immediately returns without evaluating whether additional new routes should be created or old routes should be terminated. This means the FSM can only have one 'batch' of PROVISIONING routes at a time.

For example, with max_surge=3 and desired=10:

  • Cycle 1: Creates 3 new routes (all PROVISIONING)
  • Cycle 2: Waits (PROVISIONING)
  • Cycle 3 (after 3 become HEALTHY): Creates 3 more, terminates old
  • Cycle 4: Waits again...

This serializes the rolling update into sequential waves rather than allowing overlapping operations. This is a safe design choice (prevents over-provisioning and simplifies state management), but it means the rolling update will take significantly more cycles than theoretically necessary.

This is not a bug -- just a design observation worth documenting. If faster rollouts are desired in the future, the FSM could be modified to allow creating additional routes while some are still provisioning, as long as the total stays within the surge budget.

@jopemachine
Copy link
Member Author

Security & Performance Review Summary

PR #9567: feat(B-3435): Implement Rolling Update deployment strategy

Overall Assessment

The rolling update FSM implementation is well-structured as a pure function with O(n) route classification and no DB access -- an excellent architectural choice. The test coverage is comprehensive (14 test classes, ~50 test cases) and the surge/unavailable budget calculations are mathematically correct for all tested scenarios.

No CRITICAL or HIGH severity issues were found. The code does not introduce any security vulnerabilities, injection risks, or performance regressions.

Findings

# Severity Finding File Status
1 MEDIUM UNHEALTHY/DEGRADED routes classified as "healthy" -- premature completion possible rolling_update.py:65-66 Review comment posted
2 MEDIUM Rollback path unreachable in production due to fetch_active_routes_by_endpoint_ids() filter rolling_update.py:92 Review comment posted
3 LOW No validation prevents max_surge=0, max_unavailable=0 deadlock rolling_update.py:100-105 Review comment posted
4 LOW PROVISIONING check blocks all concurrent progress (design observation) rolling_update.py:71-77 Review comment posted

Detailed Analysis

Logic Correctness (Surge/Unavailable Budget):
The math is verified correct. can_create = max_total - current_total, still_needed = desired - total_new_live, and to_create = max(0, min(can_create, still_needed)) correctly caps creation at both the surge budget and the actual need. The termination calculation can_terminate = available_count - min_available with to_terminate = max(0, min(can_terminate, len(old_active))) correctly ensures old routes are only terminated when enough capacity exists.

FSM Stuck States:
The FSM can get stuck in exactly one scenario: max_surge=0 and max_unavailable=0 (Finding #3). All other configurations allow progress. The PROVISIONING wait state (Finding #4) is not a stuck state but serializes the rollout into waves.

UNHEALTHY/DEGRADED Classification:
Finding #1 is a design decision that should be explicitly documented. The variable name new_healthy is misleading since it includes UNHEALTHY and DEGRADED routes via the is_active() fallthrough.

Missing Edge Case Coverage in Tests:

  • No test for negative max_surge or max_unavailable values (though Pydantic should reject these at the spec level)
  • No test for TERMINATING new routes (currently not classified into any bucket -- they are silently dropped)
  • The deploying_revision_id=None test uses type: ignore which masks a potential runtime issue

Positive Observations

  • Pure function design enables deterministic, highly testable FSM logic
  • Termination priority ordering (UNHEALTHY before DEGRADED before PROVISIONING before HEALTHY) is well-implemented
  • _build_route_creators correctly propagates all deployment metadata
  • The realistic multi-step scenario test (TestRealisticScenario) validates end-to-end correctness across 5 cycles
  • No parent-relative imports, follows project conventions

@jopemachine jopemachine force-pushed the BA-4821 branch 3 times, most recently from dbe2396 to 2ca587d Compare March 3, 2026 07:23
@jopemachine jopemachine changed the title feat(B-3435): Implement Rolling Update deployment strategy feat(BA-3435): Implement Rolling Update deployment strategy Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant