feat(BA-3435): Implement Rolling Update deployment strategy by jopemachine · Pull Request #9567 · lablup/backend.ai

jopemachine · 2026-03-02T01:28:48Z

Overview

Implements the Rolling Update deployment strategy (BEP-1049) — the evaluator + sub-step handler pattern that gradually replaces old-revision routes with new-revision routes.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│  Periodic Scheduler (deploying: 5s / 30s)                                   │
│    → DoDeploymentLifecycleEvent(lifecycle_type="deploying")                  │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  DeploymentCoordinator                                                      │
│                                                                             │
│  process_deployment_lifecycle("deploying")                                  │
│    evaluator = _deployment_evaluators[DEPLOYING]                            │
│    → _process_with_evaluator(DEPLOYING, evaluator)                          │
│                                                                             │
│  _process_with_evaluator:                                                   │
│    1. Acquire distributed lock (LOCKID_DEPLOYMENT_DEPLOYING)                │
│    2. Query all DEPLOYING-state endpoints                                   │
│    3. evaluator.evaluate(deployments) → EvaluationResult                    │
│    4. For each sub-step group:                                              │
│         handler = handlers[(DEPLOYING, sub_step)]                           │
│         handler.execute(group) → _handle_status_transitions()               │
│    5. handler.post_process() for each group                                 │
│    6. _transition_completed_deployments() for completed                     │
│                                                                             │
│  Handler map (DeploymentHandlerKey):                                        │
│    (DEPLOYING, PROVISIONING) → DeployingProvisioningHandler                 │
│    (DEPLOYING, PROGRESSING)  → DeployingProgressingHandler                  │
│    (DEPLOYING, ROLLED_BACK)  → DeployingRolledBackHandler                   │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  DeploymentStrategyEvaluator              (strategy/evaluator.py)           │
│                                                                             │
│  evaluate(deployments) → EvaluationResult                                   │
│    1. Bulk-load policies:  fetch_deployment_policies_by_endpoint_ids()      │
│       Bulk-load routes:    fetch_active_routes_by_endpoint_ids()            │
│    2. Per deployment → dispatch by policy.strategy:                         │
│         ROLLING → rolling_update_evaluate(deployment, routes, spec)         │
│    3. Collect route mutations from all CycleEvaluationResults               │
│    4. Apply in one batch:  _apply_route_changes(scale_out, scale_in)       │
│         scale_out → Creator[RoutingRow] (new routes)                        │
│         scale_in  → BatchUpdater (TERMINATING + INACTIVE)                   │
│    5. Group deployments by sub-step → EvaluationResult                      │
└─────────────┬───────────────────────────────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  rolling_update_evaluate()                (strategy/rolling_update.py)      │
│                                                                             │
│  Pure function: (DeploymentInfo, routes, RollingUpdateSpec) → CycleResult   │
│                                                                             │
│  FSM:                                                                       │
│    1. Classify routes by revision_id:                                       │
│         old_active:       revision != deploying_revision, is_active()       │
│         new_provisioning: revision == deploying_revision, PROVISIONING      │
│         new_healthy:      revision == deploying_revision, HEALTHY           │
│         new_failed:       revision == deploying_revision, FAILED/TERMINATED │
│                                                                             │
│    2. new_provisioning? ───────────────────→ PROVISIONING (wait)            │
│    3. no old + new_healthy >= desired? ────→ PROGRESSING (completed=True)   │
│    4. all new failed? ────────────────────→ ROLLED_BACK                     │
│    5. Compute surge/unavailable budget:                                     │
│         max_total    = desired + max_surge                                  │
│         min_available = desired - max_unavailable                           │
│         to_create    = min(max_total - current, desired - new_live)         │
│         to_terminate = min(healthy - min_available, old_active)             │
│       ─────────────────────────────────────→ PROGRESSING                    │
│                                              + RouteChanges(scale_out,      │
│                                                             scale_in)       │
└─────────────────────────────────────────────────────────────────────────────┘

Cycle-by-Cycle Example (`desired=3, max_surge=1, max_unavailable=1`)

Cycle 0 (initial)          Cycle 1 (provisioning)     Cycle 2 (1 new healthy)
Old: [■ ■ ■]               Old: [■ ■]                 Old: [■ ■]
New: []                     New: [◇]                   New: [■]
→ create 1, terminate 1    → wait (PROVISIONING)      → create 1, terminate 1
        │                          │                          │
        ▼                          ▼                          ▼
Cycle 3 (provisioning)     Cycle 4 (2 new healthy)    Cycle 5 (provisioning)
Old: [■]                    Old: [■]                   Old: []
New: [■ ◇]                 New: [■ ■]                 New: [■ ■ ◇]
→ wait (PROVISIONING)      → create 1, terminate 1    → wait (PROVISIONING)
                                   │                          │
                                   ▼                          ▼
                            Cycle 6 (completed)
                            Old: []
                            New: [■ ■ ■]
                            → completed! revision swap + DEPLOYING → READY

Legend: ■ = healthy, ◇ = provisioning

Completion Flow

rolling_update_evaluate() returns completed=True
    │
    ▼
DeploymentStrategyEvaluator collects into EvaluationResult.completed
    │
    ▼
DeploymentCoordinator._transition_completed_deployments()
  → Atomic transaction:
    1. complete_deployment_revision_swap(endpoint_ids)
       current_revision = deploying_revision, deploying_revision = NULL
    2. Lifecycle update: DEPLOYING → READY
    3. History recording
    4. Notification events

Key Types

Type	Location	Purpose
`CycleEvaluationResult`	`strategy/types.py`	Single deployment's FSM result (sub_step + completed + route_changes)
`RouteChanges`	`strategy/types.py`	Route mutations: scale_out (Creator) + scale_in (route IDs)
`EvaluationResult`	`strategy/types.py`	Aggregate: groups by sub-step + completed + skipped + errors
`RollingUpdateSpec`	`models/deployment_policy.py`	Config: max_surge, max_unavailable
`DeploymentHandlerKey`	`coordinator.py`	Handler dispatch: `DeploymentLifecycleType \| (Type, SubStep)`

Changed Files

File	Change
`strategy/rolling_update.py`	Rolling update FSM (pure function, stub → full implementation)
`test_rolling_update.py`	Unit tests: 14 scenarios covering all FSM branches

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

📚 Documentation preview 📚: https://sorna--9567.org.readthedocs.build/en/9567/

📚 Documentation preview 📚: https://sorna-ko--9567.org.readthedocs.build/ko/9567/

📚 Documentation preview 📚: https://sorna--9567.org.readthedocs.build/en/9567/

📚 Documentation preview 📚: https://sorna-ko--9567.org.readthedocs.build/ko/9567/

jopemachine · 2026-03-02T03:28:16Z

src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py

+
+    total_new_live = len(new_provisioning) + len(new_healthy)
+
+    # ── 2. PROVISIONING: wait for in-flight routes ──


[MEDIUM] UNHEALTHY/DEGRADED routes classified as healthy may cause premature completion

At lines 65-66 of the file, the fallthrough elif r.status.is_active(): new_healthy.append(r) classifies UNHEALTHY and DEGRADED new routes into the new_healthy bucket. This means the completion check at line 80 (len(new_healthy) >= desired) can mark the deployment as completed even when all new routes are UNHEALTHY or DEGRADED.

Scenario:

desired=2, 2 new routes both in UNHEALTHY status, 0 old routes

new_healthy = [UNHEALTHY_1, UNHEALTHY_2] (via the is_active() fallthrough)

len(new_healthy) >= desired = True = completed

This is tested and documented (test_unhealthy_new_counts_as_healthy, test_degraded_new_counts_as_healthy), so it appears intentional. However, it could lead to a deployment being marked as successfully completed when no route is actually serving traffic well.

Suggestion: Consider whether the completion check should require RouteStatus.HEALTHY specifically (not just is_active()), or add a comment explaining why UNHEALTHY/DEGRADED routes are acceptable for completion. The variable name new_healthy is also misleading since it contains non-healthy routes -- renaming to new_active or new_live would improve clarity.

jopemachine · 2026-03-02T03:28:26Z

src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py

+        log.warning(
+            "deployment {}: all {} new routes failed — rolling back",
+            deployment.id,
+            len(new_failed),


[MEDIUM] Rollback path unreachable in production due to upstream query filter

The rollback check at this line (total_new_live == 0 and new_failed) requires routes with FAILED_TO_START or TERMINATED status to be present. However, the evaluator layer calls fetch_active_routes_by_endpoint_ids() (in evaluator.py line 66), which filters to only active_route_statuses() = {PROVISIONING, HEALTHY, UNHEALTHY, DEGRADED}.

Since FAILED_TO_START and TERMINATED are excluded by the DB query, new_failed will always be empty in production, making this rollback path dead code at the system level.

Impact: When all new routes fail, the evaluator will see zero new routes and zero failed routes. The FSM falls through to step 5 (PROGRESSING), computes total_new_live = 0, still_needed = desired, and creates new routes -- effectively retrying infinitely instead of rolling back.

Suggestion: This is an integration-level concern (not specific to this PR's pure function). Consider either:

Expanding fetch_active_routes_by_endpoint_ids() to include FAILED_TO_START status for deployment evaluation

Adding a separate fetch_failed_routes_by_endpoint_ids() call in the evaluator

Documenting that rollback detection is intentionally deferred to a different mechanism

Note: This same concern applies to the blue-green strategy.

jopemachine · 2026-03-02T03:28:43Z

src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py

+    max_unavailable = spec.max_unavailable
+
+    # Total pods allowed at peak = desired + max_surge
+    max_total = desired + max_surge


[LOW] No validation prevents max_surge=0, max_unavailable=0 deadlock

RollingUpdateSpec (in models/deployment_policy/row.py) has no validator preventing both max_surge=0 and max_unavailable=0. This configuration creates a deadlock where the FSM returns PROGRESSING with zero route changes every cycle -- no new routes can be created and no old routes can be terminated.

The test test_surge_0_unavailable_0_deadlock documents this as a known issue. However, this will result in a deployment that appears to be actively deploying but makes zero progress indefinitely.

Suggestion: Add a Pydantic model_validator to RollingUpdateSpec that rejects max_surge=0 and max_unavailable=0 simultaneously:

@model_validator(mode='after') def validate_progress_possible(self) -> RollingUpdateSpec: if self.max_surge == 0 and self.max_unavailable == 0: raise ValueError( 'At least one of max_surge or max_unavailable must be > 0 ' 'to allow rolling update progress' ) return self

Alternatively, if this is intentionally allowed (e.g., as a 'pause' mechanism), add a comment in the spec class explaining the design choice.

jopemachine · 2026-03-02T03:28:50Z

src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py

+
+    # ── 3. Completed: all old replaced, enough new healthy ──
+    if not old_active and len(new_healthy) >= desired:
+        log.info(


[LOW] PROVISIONING check blocks all progress -- consider batching

When any new route is in PROVISIONING state (step 2), the FSM immediately returns without evaluating whether additional new routes should be created or old routes should be terminated. This means the FSM can only have one 'batch' of PROVISIONING routes at a time.

For example, with max_surge=3 and desired=10:

Cycle 1: Creates 3 new routes (all PROVISIONING)

Cycle 2: Waits (PROVISIONING)

Cycle 3 (after 3 become HEALTHY): Creates 3 more, terminates old

Cycle 4: Waits again...

This serializes the rolling update into sequential waves rather than allowing overlapping operations. This is a safe design choice (prevents over-provisioning and simplifies state management), but it means the rolling update will take significantly more cycles than theoretically necessary.

This is not a bug -- just a design observation worth documenting. If faster rollouts are desired in the future, the FSM could be modified to allow creating additional routes while some are still provisioning, as long as the total stays within the surge budget.

jopemachine · 2026-03-02T03:29:19Z

Security & Performance Review Summary

PR #9567: feat(B-3435): Implement Rolling Update deployment strategy

Overall Assessment

The rolling update FSM implementation is well-structured as a pure function with O(n) route classification and no DB access -- an excellent architectural choice. The test coverage is comprehensive (14 test classes, ~50 test cases) and the surge/unavailable budget calculations are mathematically correct for all tested scenarios.

No CRITICAL or HIGH severity issues were found. The code does not introduce any security vulnerabilities, injection risks, or performance regressions.

Findings

#	Severity	Finding	File	Status
1	MEDIUM	UNHEALTHY/DEGRADED routes classified as "healthy" -- premature completion possible	`rolling_update.py:65-66`	Review comment posted
2	MEDIUM	Rollback path unreachable in production due to `fetch_active_routes_by_endpoint_ids()` filter	`rolling_update.py:92`	Review comment posted
3	LOW	No validation prevents `max_surge=0, max_unavailable=0` deadlock	`rolling_update.py:100-105`	Review comment posted
4	LOW	PROVISIONING check blocks all concurrent progress (design observation)	`rolling_update.py:71-77`	Review comment posted

Detailed Analysis

Logic Correctness (Surge/Unavailable Budget):
The math is verified correct. can_create = max_total - current_total, still_needed = desired - total_new_live, and to_create = max(0, min(can_create, still_needed)) correctly caps creation at both the surge budget and the actual need. The termination calculation can_terminate = available_count - min_available with to_terminate = max(0, min(can_terminate, len(old_active))) correctly ensures old routes are only terminated when enough capacity exists.

FSM Stuck States:
The FSM can get stuck in exactly one scenario: max_surge=0 and max_unavailable=0 (Finding #3). All other configurations allow progress. The PROVISIONING wait state (Finding #4) is not a stuck state but serializes the rollout into waves.

UNHEALTHY/DEGRADED Classification:
Finding #1 is a design decision that should be explicitly documented. The variable name new_healthy is misleading since it includes UNHEALTHY and DEGRADED routes via the is_active() fallthrough.

Missing Edge Case Coverage in Tests:

No test for negative max_surge or max_unavailable values (though Pydantic should reject these at the spec level)
No test for TERMINATING new routes (currently not classified into any bucket -- they are silently dropped)
The deploying_revision_id=None test uses type: ignore which masks a potential runtime issue

Positive Observations

Pure function design enables deterministic, highly testable FSM logic
Termination priority ordering (UNHEALTHY before DEGRADED before PROVISIONING before HEALTHY) is well-implemented
_build_route_creators correctly propagates all deployment metadata
The realistic multi-step scenario test (TestRealisticScenario) validates end-to-end correctness across 5 cycles
No parent-relative imports, follows project conventions

github-actions bot assigned jopemachine Mar 2, 2026

github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component labels Mar 2, 2026

jopemachine changed the base branch from main to BA-4821 March 2, 2026 01:29

jopemachine added this to the 26.3 milestone Mar 2, 2026

github-actions bot added the area:docs Documentations label Mar 2, 2026

jopemachine commented Mar 2, 2026

View reviewed changes

jopemachine mentioned this pull request Mar 2, 2026

feat(BA-4821): Add deployment strategy evaluation framework #9566

Open

7 tasks

jopemachine force-pushed the BA-3838_2 branch from a676b92 to 887b86e Compare March 2, 2026 07:23

jopemachine force-pushed the BA-4821 branch 3 times, most recently from dbe2396 to 2ca587d Compare March 3, 2026 07:23

jopemachine changed the title ~~feat(B-3435): Implement Rolling Update deployment strategy~~ feat(BA-3435): Implement Rolling Update deployment strategy Mar 3, 2026

jopemachine force-pushed the BA-3838_2 branch from c2d4eca to 352cb38 Compare March 4, 2026 02:20

jopemachine force-pushed the BA-4821 branch from cb54845 to 19fe5c6 Compare March 4, 2026 02:48

jopemachine force-pushed the BA-3838_2 branch from a258a19 to ba17a45 Compare March 4, 2026 04:20

jopemachine added 3 commits March 4, 2026 09:52

wip

c0b3738

docs: Add news fragment

4b40475

wip

5b56798

jopemachine force-pushed the BA-3838_2 branch from ba17a45 to 5b56798 Compare March 4, 2026 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(BA-3435): Implement Rolling Update deployment strategy#9567

feat(BA-3435): Implement Rolling Update deployment strategy#9567
jopemachine wants to merge 3 commits intoBA-4821from
BA-3838_2

jopemachine commented Mar 2, 2026 •

edited by github-actions bot

Loading

Uh oh!

jopemachine Mar 2, 2026

Uh oh!

jopemachine Mar 2, 2026

Uh oh!

jopemachine Mar 2, 2026

Uh oh!

jopemachine Mar 2, 2026

Uh oh!

jopemachine commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		total_new_live = len(new_provisioning) + len(new_healthy)

		# ── 2. PROVISIONING: wait for in-flight routes ──

Conversation

jopemachine commented Mar 2, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture

Cycle-by-Cycle Example (desired=3, max_surge=1, max_unavailable=1)

Completion Flow

Key Types

Changed Files

Uh oh!

jopemachine Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

jopemachine Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

jopemachine Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

jopemachine Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

jopemachine commented Mar 2, 2026

Security & Performance Review Summary

Overall Assessment

Findings

Detailed Analysis

Positive Observations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jopemachine commented Mar 2, 2026 •

edited by github-actions bot

Loading

Cycle-by-Cycle Example (`desired=3, max_surge=1, max_unavailable=1`)