Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/9566.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add the DEPLOYING lifecycle with strategy evaluator framework, sub-step handlers (PROVISIONING, PROGRESSING, ROLLED_BACK), and coordinator integration for BEP-1049.
257 changes: 136 additions & 121 deletions proposals/BEP-1049-deployment-strategy-handler.md

Large diffs are not rendered by default.

58 changes: 27 additions & 31 deletions proposals/BEP-1049/blue-green.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,15 +77,15 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Blue-Green deploym

### Sub-Step Variants

Each cycle evaluation directly returns one of the shared sub-step variants:
Each cycle evaluation directly returns one of the shared sub-step variants. Completion is not a sub-step but a signal on `CycleEvaluationResult(sub_step=PROGRESSING, completed=True)` — the coordinator handles revision swap and READY transition directly.

| Sub-Step | Condition | Handler Action |
|----------|-----------|----------------|
| **provisioning** | No Green routes → created all as INACTIVE | DeployingInProgressHandler → DEPLOYING→DEPLOYING, reschedule |
| **provisioning** | Green routes are PROVISIONING | DeployingInProgressHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** | Not all Green healthy (mixed state, no PROVISIONING) | DeployingInProgressHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** | All Green healthy, waiting for promotion trigger (manual or delay) | DeployingInProgressHandler → DEPLOYING→DEPLOYING, reschedule |
| **completed** | Promotion executed (Green→ACTIVE, Blue→TERMINATING) | DeployingCompletedHandlerDEPLOYING→READY, revision swap |
| **provisioning** | No Green routes → created all as INACTIVE | DeployingProvisioningHandler → DEPLOYING→DEPLOYING, reschedule |
| **provisioning** | Green routes are PROVISIONING | DeployingProvisioningHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** | Not all Green healthy (mixed state, no PROVISIONING) | DeployingProgressingHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** | All Green healthy, waiting for promotion trigger (manual or delay) | DeployingProgressingHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** (`completed=True`) | Promotion executed (Green→ACTIVE, Blue→TERMINATING) | Coordinatoratomic revision swap + DEPLOYING→READY |
| **rolled_back** | All Green failed → terminate Green | DeployingRolledBackHandler → DEPLOYING→READY, deploying_revision=NULL |

## promote_delay_seconds Handling
Expand Down Expand Up @@ -220,8 +220,8 @@ With `auto_promote=False`:
│ strategy = policy.strategy │
│ 3. Dispatch by strategy: │
│ BLUE_GREEN → blue_green_evaluate(...) │
│ 4. Group by sub_step and return
5. Apply route changes (scale_out + scale_in)
│ 4. Aggregate route changes + group by sub_step
Coordinator applies route changes after evaluation
└──────────────────────────┬───────────────────────────────────┘
Expand All @@ -240,39 +240,36 @@ With `auto_promote=False`:
│ │ blue_active: blue + is_active() │ │
│ └────────────────────────────────────────────────────┘ │
│ │
Actions applied:
Route changes returned (applied by coordinator):
│ ┌────────────────────────────────────────────────────┐ │
│ │ ● Green creation: │ │
│ │ ● Green creation (rollout_specs): │ │
│ │ RouteCreatorSpec( │ │
│ │ revision_id = deploying_revision, │ │
│ │ traffic_status = INACTIVE ← differs from RU │ │
│ │ ) × target_count │ │
│ │ │ │
│ │ ● Promotion (traffic switch): │ │
│ │ Green: RouteBatchUpdaterSpec( │ │
│ │ traffic_status = ACTIVE │ │
│ │ ) │ │
│ │ Blue: RouteBatchUpdaterSpec( │ │
│ │ status = TERMINATING, │ │
│ │ traffic_status = INACTIVE │ │
│ │ ) │ │
│ │ promote_route_ids: Green route IDs │ │
│ │ → traffic_status = ACTIVE │ │
│ │ drain_route_ids: Blue route IDs │ │
│ │ → status = TERMINATING │ │
│ │ │ │
│ │ ● Rollback: │ │
│ │ Green: RouteBatchUpdaterSpec( │ │
│ │ status = TERMINATING │ │
│ │ ) │ │
│ │ drain_route_ids: Green route IDs │ │
│ │ → status = TERMINATING │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Per-Sub-Step Handlers (coordinator generic path) │
│ │
│ PROVISIONING/PROGRESSINGDeployingInProgressHandler
│ PROVISIONING → DeployingProvisioningHandler
│ next_status: DEPLOYING → coordinator records history │
│ │
│ COMPLETED → DeployingCompletedHandler │
│ next_status: READY → revision swap + coordinator transit │
│ PROGRESSING → DeployingProgressingHandler │
│ next_status: DEPLOYING → coordinator records history │
│ completed=True → coordinator atomic revision swap + READY │
│ │
│ ROLLED_BACK → DeployingRolledBackHandler │
│ next_status: READY → clear dep_rev + coordinator transit │
Expand All @@ -287,14 +284,13 @@ When all Green routes become ACTIVE and Blue routes are terminated:
completed determination (evaluator)
DeployingCompletedHandler.execute()
→ complete_deployment_revision_swap(ids)
current_revision = deploying_revision
deploying_revision = NULL
Coordinator generic path
→ DEPLOYING → READY history recording + lifecycle transition
Coordinator._transition_completed_deployments()
→ Atomic transaction:
1. complete_deployment_revision_swap(ids)
current_revision = deploying_revision
deploying_revision = NULL
2. DEPLOYING → READY lifecycle transition
3. History recording
```

## Comparison with Rolling Update
Expand Down
44 changes: 21 additions & 23 deletions proposals/BEP-1049/rolling-update.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,13 @@ The `DeploymentStrategyEvaluator` periodically evaluates each Rolling Update dep

### Sub-Step Variants

Each cycle evaluation directly returns one of the shared sub-step variants:
Each cycle evaluation directly returns one of the shared sub-step variants. Completion is not a sub-step but a signal on `CycleEvaluationResult(sub_step=PROGRESSING, completed=True)` — the coordinator handles revision swap and READY transition directly.

| Sub-Step | Condition | Handler Action |
|----------|-----------|----------------|
| **provisioning** | New routes are PROVISIONING | DeployingInProgressHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** | Calculated surge/unavailable, created/terminated routes | DeployingInProgressHandler → DEPLOYING→DEPLOYING, reschedule |
| **completed** | No Old routes and New healthy >= desired_replicas | DeployingCompletedHandlerDEPLOYING→READY, revision swap |
| **provisioning** | New routes are PROVISIONING | DeployingProvisioningHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** | Calculated surge/unavailable, created/terminated routes | DeployingProgressingHandler → DEPLOYING→DEPLOYING, reschedule |
| **progressing** (`completed=True`) | No Old routes and New healthy >= desired_replicas | Coordinatoratomic revision swap + DEPLOYING→READY |

## max_surge / max_unavailable Calculation

Expand Down Expand Up @@ -190,8 +190,8 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ strategy = policy.strategy │
│ 3. Dispatch by strategy: │
│ ROLLING → rolling_update_evaluate(...) │
│ 4. Group by sub_step and return
5. Apply route changes (scale_out + scale_in)
│ 4. Aggregate route changes + group by sub_step
Coordinator applies route changes after evaluation
└──────────────────────────┬───────────────────────────────────┘
Expand All @@ -209,29 +209,28 @@ Example with `desired_replicas = 3`, `max_surge = 1`, `max_unavailable = 1`:
│ │ old_active: old + is_active() │ │
│ └────────────────────────────────────────────────────┘ │
│ │
Actions applied:
Route changes returned (applied by coordinator):
│ ┌────────────────────────────────────────────────────┐ │
│ │ scale_out: RouteCreatorSpec( │ │
│ │ rollout_specs: RouteCreatorSpec( │ │
│ │ revision_id = deploying_revision, │ │
│ │ traffic_status = ACTIVE ← differs from BG │ │
│ │ ) │ │
│ │ │ │
│ │ scale_in: RouteBatchUpdaterSpec( │ │
│ │ status = TERMINATING, │ │
│ │ traffic_status = INACTIVE │ │
│ │ ) │ │
│ │ drain_route_ids: old route IDs │ │
│ │ → status = TERMINATING │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Per-Sub-Step Handlers (coordinator generic path) │
│ │
│ PROVISIONING/PROGRESSINGDeployingInProgressHandler
│ PROVISIONING → DeployingProvisioningHandler
│ next_status: DEPLOYING → coordinator records history │
│ │
│ COMPLETED → DeployingCompletedHandler │
│ next_status: READY → revision swap + coordinator transit │
│ PROGRESSING → DeployingProgressingHandler │
│ next_status: DEPLOYING → coordinator records history │
│ completed=True → coordinator atomic revision swap + READY │
└──────────────────────────────────────────────────────────────┘
```

Expand All @@ -243,12 +242,11 @@ When all Old routes are removed and New routes reach desired_replicas or above a
completed determination (evaluator)
DeployingCompletedHandler.execute()
→ complete_deployment_revision_swap(ids)
current_revision = deploying_revision
deploying_revision = NULL
Coordinator generic path
→ DEPLOYING → READY history recording + lifecycle transition
Coordinator._transition_completed_deployments()
→ Atomic transaction:
1. complete_deployment_revision_swap(ids)
current_revision = deploying_revision
deploying_revision = NULL
2. DEPLOYING → READY lifecycle transition
3. History recording
```
14 changes: 14 additions & 0 deletions src/ai/backend/manager/data/deployment/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,19 @@ class DeploymentSubStatus(enum.StrEnum):
"""


class DeploymentSubStep(DeploymentSubStatus):
"""Sub-steps for the DEPLOYING lifecycle phase.

- PROVISIONING: New revision routes are being provisioned; waiting for readiness.
- PROGRESSING: Actively replacing old routes with new routes.
- ROLLED_BACK: All new routes failed; deployment rolled back to previous revision.
"""

PROVISIONING = "provisioning"
PROGRESSING = "progressing"
ROLLED_BACK = "rolled_back"


@dataclass(frozen=True)
class DeploymentLifecycleStatus:
"""Target lifecycle state for a deployment status transition.
Expand Down Expand Up @@ -353,6 +366,7 @@ class DeploymentInfo:
network: DeploymentNetworkSpec
model_revisions: list[ModelRevisionSpec]
current_revision_id: UUID | None = None
deploying_revision_id: UUID | None = None

def target_revision(self) -> ModelRevisionSpec | None:
if self.model_revisions:
Expand Down
1 change: 1 addition & 0 deletions src/ai/backend/manager/defs.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ class LockID(enum.IntEnum):
LOCKID_DEPLOYMENT_CHECK_PENDING = 226 # For operations checking PENDING sessions
LOCKID_DEPLOYMENT_CHECK_REPLICA = 227 # For operations checking REPLICA sessions
LOCKID_DEPLOYMENT_DESTROYING = 228 # For operations destroying deployments
LOCKID_DEPLOYMENT_DEPLOYING = 229 # For operations deploying (rolling update) deployments
# Sokovan target status locks (prevent concurrent operations on same status)
LOCKID_SOKOVAN_TARGET_PENDING = 230 # For operations targeting PENDING sessions
LOCKID_SOKOVAN_TARGET_PREPARING = 231 # For operations targeting PREPARING/PULLING sessions
Expand Down
2 changes: 2 additions & 0 deletions src/ai/backend/manager/models/endpoint/row.py
Original file line number Diff line number Diff line change
Expand Up @@ -837,6 +837,7 @@ def _to_deployment_info_from_revision(
),
],
current_revision_id=self.current_revision,
deploying_revision_id=self.deploying_revision,
)

def _to_deployment_info_legacy(self) -> DeploymentInfo:
Expand Down Expand Up @@ -898,6 +899,7 @@ def _to_deployment_info_legacy(self) -> DeploymentInfo:
),
],
current_revision_id=self.current_revision,
deploying_revision_id=self.deploying_revision,
)


Expand Down
Loading
Loading