feat(BA-4821): Add deployment strategy evaluation framework#9566
feat(BA-4821): Add deployment strategy evaluation framework#9566jopemachine wants to merge 23 commits intomainfrom
Conversation
jopemachine
left a comment
There was a problem hiding this comment.
PR #9566 Security & Performance Review
Reviewer: AI Code Reviewer
Risk Level: MEDIUM
Scope: Framework PR for BEP-1049 deployment strategy evaluation infrastructure
Summary
This PR introduces the DEPLOYING lifecycle with a strategy evaluator framework, sub-step handlers (PROVISIONING, PROGRESSING, ROLLED_BACK), and coordinator integration. The architecture is well-structured with clean separation between the evaluator (which decides route mutations) and handlers (which manage lifecycle transitions). However, there are several issues that should be addressed before or shortly after merging.
Findings Overview
| # | Severity | File | Finding |
|---|---|---|---|
| 1 | HIGH | coordinator.py |
Race condition: revision swap and lifecycle transition use separate DB sessions |
| 2 | HIGH | coordinator.py |
Hardcoded "Rolling update" history message applies to all strategies |
| 3 | MEDIUM | evaluator.py / coordinator.py |
Errored deployments silently dropped -- no retry bound or failure transition |
| 4 | MEDIUM | coordinator.py |
Post-process assumes all handler executions succeeded |
| 5 | MEDIUM | coordinator.py |
Skipped deployments (no policy) are never acted upon |
| 6 | LOW | strategy/types.py |
CycleEvaluationResult docstring says "rolling update" but type is strategy-agnostic |
| 7 | LOW | (all) | Zero test coverage for the entire framework |
Detailed review comments follow on specific lines below.
src/ai/backend/manager/repositories/deployment/db_source/db_source.py
Outdated
Show resolved
Hide resolved
Code Review Summary -- PR #9566Deployment Strategy Evaluation Framework (BEP-1049) Overall AssessmentThe architecture is well-designed. The evaluator pattern (bulk-load, per-deployment dispatch, aggregate mutations, batch apply) is clean and scalable. The handler hierarchy with composite keys ( Findings Addressed (8 total)
RecommendationsBefore merge (HIGH):
Before child PRs merge (MEDIUM):
Backlog (LOW):
|
dbe2396 to
2ca587d
Compare
There was a problem hiding this comment.
Pull request overview
Adds the foundational infrastructure for BEP-1049 deployment strategies by introducing a DEPLOYING lifecycle phase, strategy evaluation scaffolding, and coordinator/repository integrations to support future Rolling Update / Blue-Green implementations.
Changes:
- Introduces
DEPLOYINGlifecycle type plusDeploymentSubStepand new sub-step handlers (PROVISIONING / PROGRESSING / ROLLED_BACK). - Adds a strategy evaluator framework that bulk-loads policies/routes, groups deployments by sub-step, and applies batched route mutations.
- Updates activation flow to set
deploying_revisionand transition endpoints intoDEPLOYING, with repository/db support for revision swap and lifecycle transitions.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/manager/sokovan/deployment/types.py | Adds DEPLOYING lifecycle type and extends execution result to include completed. |
| src/ai/backend/manager/sokovan/deployment/strategy/types.py | Defines evaluation result types for strategy FSM dispatch and aggregation. |
| src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py | Implements bulk policy/route loading, per-deployment dispatch, and batched route mutation application. |
| src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py | Adds rolling update evaluator stub (currently not implemented). |
| src/ai/backend/manager/sokovan/deployment/strategy/blue_green.py | Adds blue-green evaluator stub (currently not implemented). |
| src/ai/backend/manager/sokovan/deployment/strategy/init.py | Declares the strategy evaluation package. |
| src/ai/backend/manager/sokovan/deployment/handlers/deploying.py | Adds DEPLOYING sub-step handlers and scheduling/provisioning triggers. |
| src/ai/backend/manager/sokovan/deployment/handlers/init.py | Exports new DEPLOYING handlers. |
| src/ai/backend/manager/sokovan/deployment/coordinator.py | Integrates evaluator path, sub-step routing, completed-transition logic, and periodic DEPLOYING tasks. |
| src/ai/backend/manager/services/deployment/service.py | Changes revision activation to start DEPLOYING flow via deploying_revision. |
| src/ai/backend/manager/repositories/deployment/repository.py | Adds deploying-related repository methods (policy bulk fetch, revision swap, atomic complete+transition, clear deploying rev). |
| src/ai/backend/manager/repositories/deployment/db_source/db_source.py | Implements DB operations for deploying start, policy bulk fetch, revision swap, atomic complete+transition, clear deploying rev. |
| src/ai/backend/manager/models/endpoint/row.py | Includes deploying_revision_id in DeploymentInfo conversion. |
| src/ai/backend/manager/defs.py | Adds a distributed lock ID for DEPLOYING processing. |
| src/ai/backend/manager/data/deployment/types.py | Adds DeploymentSubStep and deploying_revision_id to DeploymentInfo. |
| proposals/BEP-1049/rolling-update.md | Updates proposal to match coordinator-driven completion handling. |
| proposals/BEP-1049/blue-green.md | Updates proposal to match coordinator-driven completion handling. |
| proposals/BEP-1049-deployment-strategy-handler.md | Updates design doc for evaluator + sub-step handler pattern and coordinator completion handling. |
| changes/9566.feature.md | Adds release note fragment for the new DEPLOYING framework. |
Comments suppressed due to low confidence (2)
src/ai/backend/manager/sokovan/deployment/coordinator.py:413
pool.build_all_records()is called before any sub-step handlerexecute()calls.RecordPool.build_all_records()caches the built records (_built=True), so any recorder steps/phases added during handler execution will be missing fromrecordsused by_handle_status_transitions(). To preserve accurate sub-step recording, deferbuild_all_records()until after all handler executions (and then do history writing), or avoid caching by using separate recorder scopes per handler group.
with DeploymentRecorderContext.scope(
lifecycle_type.value, entity_ids=deployment_ids
) as pool:
eval_result = await evaluator.evaluate(deployments)
all_records = pool.build_all_records()
# Process each sub-step group with its handler
for sub_step, group in eval_result.groups.items():
handler_key: DeploymentHandlerKey = (lifecycle_type, sub_step)
handler = self._deployment_handlers.get(handler_key)
if handler is None:
log.warning(
"No handler for sub-step {}/{}", lifecycle_type.value, sub_step.value
)
continue
sub_result = await handler.execute(group.deployments)
sub_results[sub_step] = sub_result
await self._handle_status_transitions(handler, sub_result, all_records)
src/ai/backend/manager/services/deployment/service.py:540
- Test coverage:
activate_revision()changed from updatingcurrent_revision+ triggeringCHECK_REPLICAto settingdeploying_revision+ triggeringDEPLOYING. There are already unit tests forDeploymentServiceintests/unit/manager/services/deployment/, but none exercising this method’s new control flow. Adding a unit test here (mockingstart_deploying_revision()andmark_lifecycle_needed(DEPLOYING)) would prevent regressions and make the API behavior change explicit.
async def activate_revision(
self, action: ActivateRevisionAction
) -> ActivateRevisionActionResult:
"""Activate a specific revision by initiating the deployment strategy.
Sets deploying_revision and transitions the deployment to DEPLOYING state.
The coordinator will execute the configured deployment strategy (rolling update,
blue-green, etc.) and swap deploying_revision → current_revision on completion.
Args:
action: Action containing deployment and revision IDs
Returns:
ActivateRevisionActionResult: Result containing the updated deployment
"""
# 1. Validate revision exists (raises exception if not found)
_revision = await self._deployment_repository.get_revision(action.revision_id)
# 2. Set deploying_revision and transition to DEPLOYING lifecycle
previous_revision_id = await self._deployment_repository.start_deploying_revision(
action.deployment_id, action.revision_id
)
# 3. Trigger DEPLOYING lifecycle to start strategy execution
await self._deployment_controller.mark_lifecycle_needed(DeploymentLifecycleType.DEPLOYING)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history( | ||
| [], BulkCreator(specs=skipped_history_specs) | ||
| ) |
There was a problem hiding this comment.
Same issue as the evaluation-error history block: calling update_endpoint_lifecycle_bulk_with_history([], ...) will not record SKIPPED history because the DB layer returns early when batch_updaters is empty. Use a history-only insertion path or pass a no-op updater so the transaction writes history rows.
| # 2. Set deploying_revision and transition to DEPLOYING lifecycle | ||
| previous_revision_id = await self._deployment_repository.start_deploying_revision( | ||
| action.deployment_id, action.revision_id | ||
| ) | ||
|
|
||
| # 3. Trigger lifecycle check to update routes with new revision | ||
| await self._deployment_controller.mark_lifecycle_needed( | ||
| DeploymentLifecycleType.CHECK_REPLICA | ||
| ) | ||
| # 3. Trigger DEPLOYING lifecycle to start strategy execution | ||
| await self._deployment_controller.mark_lifecycle_needed(DeploymentLifecycleType.DEPLOYING) |
There was a problem hiding this comment.
activate_revision() now always transitions the endpoint into DEPLOYING and relies on the strategy evaluator to finish the revision swap. In this PR, both rolling_update_evaluate() and blue_green_evaluate() are stubs that raise NotImplementedError, which means any activation will leave the deployment stuck in DEPLOYING (and repeatedly error/retry). To avoid a user-visible regression, consider gating this new path behind a feature flag or keeping the previous immediate activation behavior until at least one strategy is implemented (or provide a temporary “direct swap” strategy).
| await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history( | ||
| [], BulkCreator(specs=error_history_specs) | ||
| ) |
There was a problem hiding this comment.
In the evaluator path, these calls pass an empty batch_updaters list into update_endpoint_lifecycle_bulk_with_history(). The DB implementation returns early when batch_updaters is empty, so no history rows will be recorded for evaluation errors (and the method returns 0). Consider either (a) adding a dedicated repository method for inserting deployment history without lifecycle updates, or (b) supplying a minimal no-op BatchUpdater that keeps the lifecycle at DEPLOYING so history insertion still runs.
There was a problem hiding this comment.
Please check this comment and resolve if it is false
| all_scale_out.extend(changes.scale_out_specs) | ||
| all_scale_in_ids.extend(changes.scale_in_route_ids) | ||
|
|
||
| # Group by sub-step | ||
| if cycle_result.completed: | ||
| result.completed.append(deployment) | ||
| result.completed_strategies[deployment.id] = policy.strategy | ||
| else: | ||
| group = result.groups.setdefault( | ||
| cycle_result.sub_step, | ||
| EvaluationGroup(sub_step=cycle_result.sub_step), | ||
| ) | ||
| group.deployments.append(deployment) |
There was a problem hiding this comment.
Scale out and scale in should operate independently of deploy, right? Is it correct that scale out and scale in are included?
There was a problem hiding this comment.
@HyeockJinKim This logic is used when calling activate_revision to spin up new routes for the new revision and bring down the existing routes.
Could you clarify what exactly you are expecting?
- Do you prefer not to use the terms scale_in and scale_out?
- Do you want this operation to be delegated to the coordinator?
- Or do you feel the overall structure itself is incorrect?
Implement the blue-green FSM in blue_green_evaluate() with 8-step flow: classify routes → create green (INACTIVE) → wait provisioning → rollback if all failed → wait healthy < desired → manual wait → delay check → atomic promotion (green→ACTIVE + blue→TERMINATING). Key changes: - Add promote_route_ids to RouteChanges for green→ACTIVE promotion - Add status_updated_at to RoutingRow/RouteInfo for promote_delay_seconds - Add fetch_routes_by_endpoint_ids (no status filter) for rollback detection - Extend scale_routes with promote_updater parameter - Auto-set status_updated_at in RouteBatchUpdaterSpec on status change - Add alembic migration for status_updated_at column Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 23732f4.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_deployment_polices`
src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py
Outdated
Show resolved
Hide resolved
| await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history( | ||
| [], BulkCreator(specs=error_history_specs) | ||
| ) |
There was a problem hiding this comment.
Please check this comment and resolve if it is false
| await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history( | ||
| [], BulkCreator(specs=skipped_history_specs) | ||
| ) |
| # Post-process outside recorder scope using actual sub_results (Finding 4) | ||
| for sub_step, group in eval_result.groups.items(): | ||
| handler_key = (lifecycle_type, sub_step) | ||
| handler = self._deployment_handlers.get(handler_key) | ||
| if handler is None: | ||
| continue | ||
| try: | ||
| actual_result = sub_results.get( | ||
| sub_step, | ||
| DeploymentExecutionResult(successes=group.deployments), | ||
| ) | ||
| await handler.post_process(actual_result) | ||
| except Exception as e: | ||
| log.error( | ||
| "Error during post-processing for sub-step {}: {}", | ||
| sub_step.value, | ||
| e, | ||
| ) |
There was a problem hiding this comment.
What's the purpose of this post process?
There was a problem hiding this comment.
I think there appears to be an issue with the current process for handling substeps.
(Substeps should be handled in a more generalized way and should not be coupled with the evaluator implementation.)
I will review and refactor it, then follow up with another reply.
| lifecycle_type: DeploymentLifecycleType, | ||
| completed: list[DeploymentInfo], | ||
| strategies: dict[UUID, DeploymentStrategy], | ||
| records: Mapping[UUID, ExecutionRecord], |
There was a problem hiding this comment.
Those are all necessary to determine status?
resolves #9565 (BA-4821)
Checklist: (if applicable)
ai.backend.testdocsdirectory