Skip to content

feat(BA-4821): Add deployment strategy evaluation framework#9566

Open
jopemachine wants to merge 23 commits intomainfrom
BA-4821
Open

feat(BA-4821): Add deployment strategy evaluation framework#9566
jopemachine wants to merge 23 commits intomainfrom
BA-4821

Conversation

@jopemachine
Copy link
Member

@jopemachine jopemachine commented Mar 2, 2026

resolves #9565 (BA-4821)

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component labels Mar 2, 2026
@jopemachine jopemachine added this to the 26.3 milestone Mar 2, 2026
Copy link
Member Author

@jopemachine jopemachine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #9566 Security & Performance Review

Reviewer: AI Code Reviewer
Risk Level: MEDIUM
Scope: Framework PR for BEP-1049 deployment strategy evaluation infrastructure


Summary

This PR introduces the DEPLOYING lifecycle with a strategy evaluator framework, sub-step handlers (PROVISIONING, PROGRESSING, ROLLED_BACK), and coordinator integration. The architecture is well-structured with clean separation between the evaluator (which decides route mutations) and handlers (which manage lifecycle transitions). However, there are several issues that should be addressed before or shortly after merging.

Findings Overview

# Severity File Finding
1 HIGH coordinator.py Race condition: revision swap and lifecycle transition use separate DB sessions
2 HIGH coordinator.py Hardcoded "Rolling update" history message applies to all strategies
3 MEDIUM evaluator.py / coordinator.py Errored deployments silently dropped -- no retry bound or failure transition
4 MEDIUM coordinator.py Post-process assumes all handler executions succeeded
5 MEDIUM coordinator.py Skipped deployments (no policy) are never acted upon
6 LOW strategy/types.py CycleEvaluationResult docstring says "rolling update" but type is strategy-agnostic
7 LOW (all) Zero test coverage for the entire framework

Detailed review comments follow on specific lines below.

@jopemachine
Copy link
Member Author

Code Review Summary -- PR #9566

Deployment Strategy Evaluation Framework (BEP-1049)


Overall Assessment

The architecture is well-designed. The evaluator pattern (bulk-load, per-deployment dispatch, aggregate mutations, batch apply) is clean and scalable. The handler hierarchy with composite keys (lifecycle_type, sub_step) is a good extension of the existing handler pattern. The separation between route mutation (evaluator) and lifecycle management (handlers) is well-considered.

Findings Addressed (8 total)

# Severity Location Finding
1 HIGH coordinator.py:433 Race condition: complete_deployment_revision_swap and update_endpoint_lifecycle_bulk_with_history use separate DB sessions. A crash between them leaves deployments in an inconsistent state.
2 HIGH coordinator.py:458 Hardcoded "Rolling update completed successfully" message is used for all strategies, including blue-green.
3 MEDIUM evaluator.py:85 eval_result.errors is populated but never acted upon by the coordinator. Errored deployments remain stuck in DEPLOYING with infinite retry.
4 MEDIUM coordinator.py:406 Post-process unconditionally treats all deployments as successes, discarding the actual sub_result from handler.execute().
5 MEDIUM evaluator.py:77 Skipped deployments (no policy) are never acted upon by the coordinator. They remain stuck in DEPLOYING forever.
6 LOW strategy/types.py:25 CycleEvaluationResult docstring says "rolling update" but the type is strategy-agnostic.
7 LOW (all new files) Zero test coverage across ~750 lines of new framework code.
8 MEDIUM db_source.py:2299 complete_deployment_revision_swap can set current_revision to NULL if deploying_revision is already NULL (double-call safety).

Recommendations

Before merge (HIGH):

  • Finding 1 + 8: Add a WHERE guard (deploying_revision IS NOT NULL) to complete_deployment_revision_swap at minimum. Ideally, combine the swap and lifecycle transition into a single transaction.
  • Finding 2: Parameterize the history message with the actual strategy name.

Before child PRs merge (MEDIUM):

  • Findings 3 + 5: Add coordinator handling for eval_result.errors and eval_result.skipped (at minimum, log them; ideally, add bounded retry + error transition).
  • Finding 4: Capture and pass through the actual sub_result to post_process.

Backlog (LOW):

  • Finding 6: Fix docstring.
  • Finding 7: Add unit tests for the evaluator, coordinator integration, and handler classes.

@jopemachine jopemachine force-pushed the BA-4821 branch 4 times, most recently from dbe2396 to 2ca587d Compare March 3, 2026 07:23
@jopemachine jopemachine marked this pull request as ready for review March 3, 2026 07:24
@jopemachine jopemachine requested review from a team, HyeockJinKim and Copilot March 3, 2026 07:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the foundational infrastructure for BEP-1049 deployment strategies by introducing a DEPLOYING lifecycle phase, strategy evaluation scaffolding, and coordinator/repository integrations to support future Rolling Update / Blue-Green implementations.

Changes:

  • Introduces DEPLOYING lifecycle type plus DeploymentSubStep and new sub-step handlers (PROVISIONING / PROGRESSING / ROLLED_BACK).
  • Adds a strategy evaluator framework that bulk-loads policies/routes, groups deployments by sub-step, and applies batched route mutations.
  • Updates activation flow to set deploying_revision and transition endpoints into DEPLOYING, with repository/db support for revision swap and lifecycle transitions.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/ai/backend/manager/sokovan/deployment/types.py Adds DEPLOYING lifecycle type and extends execution result to include completed.
src/ai/backend/manager/sokovan/deployment/strategy/types.py Defines evaluation result types for strategy FSM dispatch and aggregation.
src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py Implements bulk policy/route loading, per-deployment dispatch, and batched route mutation application.
src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py Adds rolling update evaluator stub (currently not implemented).
src/ai/backend/manager/sokovan/deployment/strategy/blue_green.py Adds blue-green evaluator stub (currently not implemented).
src/ai/backend/manager/sokovan/deployment/strategy/init.py Declares the strategy evaluation package.
src/ai/backend/manager/sokovan/deployment/handlers/deploying.py Adds DEPLOYING sub-step handlers and scheduling/provisioning triggers.
src/ai/backend/manager/sokovan/deployment/handlers/init.py Exports new DEPLOYING handlers.
src/ai/backend/manager/sokovan/deployment/coordinator.py Integrates evaluator path, sub-step routing, completed-transition logic, and periodic DEPLOYING tasks.
src/ai/backend/manager/services/deployment/service.py Changes revision activation to start DEPLOYING flow via deploying_revision.
src/ai/backend/manager/repositories/deployment/repository.py Adds deploying-related repository methods (policy bulk fetch, revision swap, atomic complete+transition, clear deploying rev).
src/ai/backend/manager/repositories/deployment/db_source/db_source.py Implements DB operations for deploying start, policy bulk fetch, revision swap, atomic complete+transition, clear deploying rev.
src/ai/backend/manager/models/endpoint/row.py Includes deploying_revision_id in DeploymentInfo conversion.
src/ai/backend/manager/defs.py Adds a distributed lock ID for DEPLOYING processing.
src/ai/backend/manager/data/deployment/types.py Adds DeploymentSubStep and deploying_revision_id to DeploymentInfo.
proposals/BEP-1049/rolling-update.md Updates proposal to match coordinator-driven completion handling.
proposals/BEP-1049/blue-green.md Updates proposal to match coordinator-driven completion handling.
proposals/BEP-1049-deployment-strategy-handler.md Updates design doc for evaluator + sub-step handler pattern and coordinator completion handling.
changes/9566.feature.md Adds release note fragment for the new DEPLOYING framework.
Comments suppressed due to low confidence (2)

src/ai/backend/manager/sokovan/deployment/coordinator.py:413

  • pool.build_all_records() is called before any sub-step handler execute() calls. RecordPool.build_all_records() caches the built records (_built=True), so any recorder steps/phases added during handler execution will be missing from records used by _handle_status_transitions(). To preserve accurate sub-step recording, defer build_all_records() until after all handler executions (and then do history writing), or avoid caching by using separate recorder scopes per handler group.
            with DeploymentRecorderContext.scope(
                lifecycle_type.value, entity_ids=deployment_ids
            ) as pool:
                eval_result = await evaluator.evaluate(deployments)
                all_records = pool.build_all_records()

                # Process each sub-step group with its handler
                for sub_step, group in eval_result.groups.items():
                    handler_key: DeploymentHandlerKey = (lifecycle_type, sub_step)
                    handler = self._deployment_handlers.get(handler_key)
                    if handler is None:
                        log.warning(
                            "No handler for sub-step {}/{}", lifecycle_type.value, sub_step.value
                        )
                        continue

                    sub_result = await handler.execute(group.deployments)
                    sub_results[sub_step] = sub_result
                    await self._handle_status_transitions(handler, sub_result, all_records)

src/ai/backend/manager/services/deployment/service.py:540

  • Test coverage: activate_revision() changed from updating current_revision + triggering CHECK_REPLICA to setting deploying_revision + triggering DEPLOYING. There are already unit tests for DeploymentService in tests/unit/manager/services/deployment/, but none exercising this method’s new control flow. Adding a unit test here (mocking start_deploying_revision() and mark_lifecycle_needed(DEPLOYING)) would prevent regressions and make the API behavior change explicit.
    async def activate_revision(
        self, action: ActivateRevisionAction
    ) -> ActivateRevisionActionResult:
        """Activate a specific revision by initiating the deployment strategy.

        Sets deploying_revision and transitions the deployment to DEPLOYING state.
        The coordinator will execute the configured deployment strategy (rolling update,
        blue-green, etc.) and swap deploying_revision → current_revision on completion.

        Args:
            action: Action containing deployment and revision IDs

        Returns:
            ActivateRevisionActionResult: Result containing the updated deployment
        """
        # 1. Validate revision exists (raises exception if not found)
        _revision = await self._deployment_repository.get_revision(action.revision_id)

        # 2. Set deploying_revision and transition to DEPLOYING lifecycle
        previous_revision_id = await self._deployment_repository.start_deploying_revision(
            action.deployment_id, action.revision_id
        )

        # 3. Trigger DEPLOYING lifecycle to start strategy execution
        await self._deployment_controller.mark_lifecycle_needed(DeploymentLifecycleType.DEPLOYING)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +449 to +451
await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history(
[], BulkCreator(specs=skipped_history_specs)
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the evaluation-error history block: calling update_endpoint_lifecycle_bulk_with_history([], ...) will not record SKIPPED history because the DB layer returns early when batch_updaters is empty. Use a history-only insertion path or pass a no-op updater so the transaction writes history rows.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines +534 to +540
# 2. Set deploying_revision and transition to DEPLOYING lifecycle
previous_revision_id = await self._deployment_repository.start_deploying_revision(
action.deployment_id, action.revision_id
)

# 3. Trigger lifecycle check to update routes with new revision
await self._deployment_controller.mark_lifecycle_needed(
DeploymentLifecycleType.CHECK_REPLICA
)
# 3. Trigger DEPLOYING lifecycle to start strategy execution
await self._deployment_controller.mark_lifecycle_needed(DeploymentLifecycleType.DEPLOYING)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activate_revision() now always transitions the endpoint into DEPLOYING and relies on the strategy evaluator to finish the revision swap. In this PR, both rolling_update_evaluate() and blue_green_evaluate() are stubs that raise NotImplementedError, which means any activation will leave the deployment stuck in DEPLOYING (and repeatedly error/retry). To avoid a user-visible regression, consider gating this new path behind a feature flag or keeping the previous immediate activation behavior until at least one strategy is implemented (or provide a temporary “direct swap” strategy).

Copilot uses AI. Check for mistakes.
Comment on lines +429 to +431
await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history(
[], BulkCreator(specs=error_history_specs)
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the evaluator path, these calls pass an empty batch_updaters list into update_endpoint_lifecycle_bulk_with_history(). The DB implementation returns early when batch_updaters is empty, so no history rows will be recorded for evaluation errors (and the method returns 0). Consider either (a) adding a dedicated repository method for inserting deployment history without lifecycle updates, or (b) supplying a minimal no-op BatchUpdater that keeps the lifecycle at DEPLOYING so history insertion still runs.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this comment and resolve if it is false

Comment on lines +90 to +102
all_scale_out.extend(changes.scale_out_specs)
all_scale_in_ids.extend(changes.scale_in_route_ids)

# Group by sub-step
if cycle_result.completed:
result.completed.append(deployment)
result.completed_strategies[deployment.id] = policy.strategy
else:
group = result.groups.setdefault(
cycle_result.sub_step,
EvaluationGroup(sub_step=cycle_result.sub_step),
)
group.deployments.append(deployment)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scale out and scale in should operate independently of deploy, right? Is it correct that scale out and scale in are included?

Copy link
Member Author

@jopemachine jopemachine Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyeockJinKim This logic is used when calling activate_revision to spin up new routes for the new revision and bring down the existing routes.

Could you clarify what exactly you are expecting?

  • Do you prefer not to use the terms scale_in and scale_out?
  • Do you want this operation to be delegated to the coordinator?
  • Or do you feel the overall structure itself is incorrect?

@jopemachine jopemachine requested a review from HyeockJinKim March 3, 2026 07:47
jopemachine and others added 13 commits March 4, 2026 02:48
Implement the blue-green FSM in blue_green_evaluate() with 8-step flow:
classify routes → create green (INACTIVE) → wait provisioning → rollback
if all failed → wait healthy < desired → manual wait → delay check →
atomic promotion (green→ACTIVE + blue→TERMINATING).

Key changes:
- Add promote_route_ids to RouteChanges for green→ACTIVE promotion
- Add status_updated_at to RoutingRow/RouteInfo for promote_delay_seconds
- Add fetch_routes_by_endpoint_ids (no status filter) for rollback detection
- Extend scale_routes with promote_updater parameter
- Auto-set status_updated_at in RouteBatchUpdaterSpec on status change
- Add alembic migration for status_updated_at column

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jopemachine jopemachine requested a review from a team March 4, 2026 04:01
@jopemachine jopemachine requested review from a team and HyeockJinKim March 4, 2026 06:29
Comment on lines +429 to +431
await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history(
[], BulkCreator(specs=error_history_specs)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check this comment and resolve if it is false

Comment on lines +449 to +451
await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history(
[], BulkCreator(specs=skipped_history_specs)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines +481 to +498
# Post-process outside recorder scope using actual sub_results (Finding 4)
for sub_step, group in eval_result.groups.items():
handler_key = (lifecycle_type, sub_step)
handler = self._deployment_handlers.get(handler_key)
if handler is None:
continue
try:
actual_result = sub_results.get(
sub_step,
DeploymentExecutionResult(successes=group.deployments),
)
await handler.post_process(actual_result)
except Exception as e:
log.error(
"Error during post-processing for sub-step {}: {}",
sub_step.value,
e,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this post process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there appears to be an issue with the current process for handling substeps.
(Substeps should be handled in a more generalized way and should not be coupled with the evaluator implementation.)
I will review and refactor it, then follow up with another reply.

Comment on lines +538 to +541
lifecycle_type: DeploymentLifecycleType,
completed: list[DeploymentInfo],
strategies: dict[UUID, DeploymentStrategy],
records: Mapping[UUID, ExecutionRecord],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are all necessary to determine status?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add deployment strategy evaluation framework (BEP-1049)

4 participants