feat(BA-4821): Add deployment strategy evaluation framework by jopemachine · Pull Request #9566 · lablup/backend.ai

jopemachine · 2026-03-02T00:05:31Z

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

jopemachine

PR #9566 Security & Performance Review

Reviewer: AI Code Reviewer
Risk Level: MEDIUM
Scope: Framework PR for BEP-1049 deployment strategy evaluation infrastructure

Summary

This PR introduces the DEPLOYING lifecycle with a strategy evaluator framework, sub-step handlers (PROVISIONING, PROGRESSING, ROLLED_BACK), and coordinator integration. The architecture is well-structured with clean separation between the evaluator (which decides route mutations) and handlers (which manage lifecycle transitions). However, there are several issues that should be addressed before or shortly after merging.

Findings Overview

#	Severity	File	Finding
1	HIGH	`coordinator.py`	Race condition: revision swap and lifecycle transition use separate DB sessions
2	HIGH	`coordinator.py`	Hardcoded "Rolling update" history message applies to all strategies
3	MEDIUM	`evaluator.py` / `coordinator.py`	Errored deployments silently dropped -- no retry bound or failure transition
4	MEDIUM	`coordinator.py`	Post-process assumes all handler executions succeeded
5	MEDIUM	`coordinator.py`	Skipped deployments (no policy) are never acted upon
6	LOW	`strategy/types.py`	`CycleEvaluationResult` docstring says "rolling update" but type is strategy-agnostic
7	LOW	(all)	Zero test coverage for the entire framework

Detailed review comments follow on specific lines below.

src/ai/backend/manager/sokovan/deployment/coordinator.py

src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py

src/ai/backend/manager/sokovan/deployment/coordinator.py

src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py

src/ai/backend/manager/sokovan/deployment/strategy/types.py

src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py

src/ai/backend/manager/repositories/deployment/db_source/db_source.py

jopemachine · 2026-03-02T03:53:14Z

Code Review Summary -- PR #9566

Deployment Strategy Evaluation Framework (BEP-1049)

Overall Assessment

The architecture is well-designed. The evaluator pattern (bulk-load, per-deployment dispatch, aggregate mutations, batch apply) is clean and scalable. The handler hierarchy with composite keys (lifecycle_type, sub_step) is a good extension of the existing handler pattern. The separation between route mutation (evaluator) and lifecycle management (handlers) is well-considered.

Findings Addressed (8 total)

#	Severity	Location	Finding
1	HIGH	`coordinator.py:433`	Race condition: `complete_deployment_revision_swap` and `update_endpoint_lifecycle_bulk_with_history` use separate DB sessions. A crash between them leaves deployments in an inconsistent state.
2	HIGH	`coordinator.py:458`	Hardcoded `"Rolling update completed successfully"` message is used for all strategies, including blue-green.
3	MEDIUM	`evaluator.py:85`	`eval_result.errors` is populated but never acted upon by the coordinator. Errored deployments remain stuck in DEPLOYING with infinite retry.
4	MEDIUM	`coordinator.py:406`	Post-process unconditionally treats all deployments as successes, discarding the actual `sub_result` from `handler.execute()`.
5	MEDIUM	`evaluator.py:77`	Skipped deployments (no policy) are never acted upon by the coordinator. They remain stuck in DEPLOYING forever.
6	LOW	`strategy/types.py:25`	`CycleEvaluationResult` docstring says "rolling update" but the type is strategy-agnostic.
7	LOW	(all new files)	Zero test coverage across ~750 lines of new framework code.
8	MEDIUM	`db_source.py:2299`	`complete_deployment_revision_swap` can set `current_revision` to NULL if `deploying_revision` is already NULL (double-call safety).

Recommendations

Before merge (HIGH):

Finding 1 + 8: Add a WHERE guard (deploying_revision IS NOT NULL) to complete_deployment_revision_swap at minimum. Ideally, combine the swap and lifecycle transition into a single transaction.
Finding 2: Parameterize the history message with the actual strategy name.

Before child PRs merge (MEDIUM):

Findings 3 + 5: Add coordinator handling for eval_result.errors and eval_result.skipped (at minimum, log them; ideally, add bounded retry + error transition).
Finding 4: Capture and pass through the actual sub_result to post_process.

Backlog (LOW):

Finding 6: Fix docstring.
Finding 7: Add unit tests for the evaluator, coordinator integration, and handler classes.

src/ai/backend/manager/data/deployment/types.py

src/ai/backend/manager/defs.py

src/ai/backend/manager/models/endpoint/row.py

src/ai/backend/manager/sokovan/deployment/types.py

src/ai/backend/manager/repositories/deployment/db_source/db_source.py

src/ai/backend/manager/sokovan/deployment/coordinator.py

Copilot

Pull request overview

Adds the foundational infrastructure for BEP-1049 deployment strategies by introducing a DEPLOYING lifecycle phase, strategy evaluation scaffolding, and coordinator/repository integrations to support future Rolling Update / Blue-Green implementations.

Changes:

Introduces DEPLOYING lifecycle type plus DeploymentSubStep and new sub-step handlers (PROVISIONING / PROGRESSING / ROLLED_BACK).
Adds a strategy evaluator framework that bulk-loads policies/routes, groups deployments by sub-step, and applies batched route mutations.
Updates activation flow to set deploying_revision and transition endpoints into DEPLOYING, with repository/db support for revision swap and lifecycle transitions.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/ai/backend/manager/sokovan/deployment/types.py	Adds `DEPLOYING` lifecycle type and extends execution result to include `completed`.
src/ai/backend/manager/sokovan/deployment/strategy/types.py	Defines evaluation result types for strategy FSM dispatch and aggregation.
src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py	Implements bulk policy/route loading, per-deployment dispatch, and batched route mutation application.
src/ai/backend/manager/sokovan/deployment/strategy/rolling_update.py	Adds rolling update evaluator stub (currently not implemented).
src/ai/backend/manager/sokovan/deployment/strategy/blue_green.py	Adds blue-green evaluator stub (currently not implemented).
src/ai/backend/manager/sokovan/deployment/strategy/init.py	Declares the strategy evaluation package.
src/ai/backend/manager/sokovan/deployment/handlers/deploying.py	Adds DEPLOYING sub-step handlers and scheduling/provisioning triggers.
src/ai/backend/manager/sokovan/deployment/handlers/init.py	Exports new DEPLOYING handlers.
src/ai/backend/manager/sokovan/deployment/coordinator.py	Integrates evaluator path, sub-step routing, completed-transition logic, and periodic DEPLOYING tasks.
src/ai/backend/manager/services/deployment/service.py	Changes revision activation to start DEPLOYING flow via `deploying_revision`.
src/ai/backend/manager/repositories/deployment/repository.py	Adds deploying-related repository methods (policy bulk fetch, revision swap, atomic complete+transition, clear deploying rev).
src/ai/backend/manager/repositories/deployment/db_source/db_source.py	Implements DB operations for deploying start, policy bulk fetch, revision swap, atomic complete+transition, clear deploying rev.
src/ai/backend/manager/models/endpoint/row.py	Includes `deploying_revision_id` in `DeploymentInfo` conversion.
src/ai/backend/manager/defs.py	Adds a distributed lock ID for DEPLOYING processing.
src/ai/backend/manager/data/deployment/types.py	Adds `DeploymentSubStep` and `deploying_revision_id` to `DeploymentInfo`.
proposals/BEP-1049/rolling-update.md	Updates proposal to match coordinator-driven completion handling.
proposals/BEP-1049/blue-green.md	Updates proposal to match coordinator-driven completion handling.
proposals/BEP-1049-deployment-strategy-handler.md	Updates design doc for evaluator + sub-step handler pattern and coordinator completion handling.
changes/9566.feature.md	Adds release note fragment for the new DEPLOYING framework.

Comments suppressed due to low confidence (2)

src/ai/backend/manager/sokovan/deployment/coordinator.py:413

pool.build_all_records() is called before any sub-step handler execute() calls. RecordPool.build_all_records() caches the built records (_built=True), so any recorder steps/phases added during handler execution will be missing from records used by _handle_status_transitions(). To preserve accurate sub-step recording, defer build_all_records() until after all handler executions (and then do history writing), or avoid caching by using separate recorder scopes per handler group.

            with DeploymentRecorderContext.scope(
                lifecycle_type.value, entity_ids=deployment_ids
            ) as pool:
                eval_result = await evaluator.evaluate(deployments)
                all_records = pool.build_all_records()

                # Process each sub-step group with its handler
                for sub_step, group in eval_result.groups.items():
                    handler_key: DeploymentHandlerKey = (lifecycle_type, sub_step)
                    handler = self._deployment_handlers.get(handler_key)
                    if handler is None:
                        log.warning(
                            "No handler for sub-step {}/{}", lifecycle_type.value, sub_step.value
                        )
                        continue

                    sub_result = await handler.execute(group.deployments)
                    sub_results[sub_step] = sub_result
                    await self._handle_status_transitions(handler, sub_result, all_records)

src/ai/backend/manager/services/deployment/service.py:540

Test coverage: activate_revision() changed from updating current_revision + triggering CHECK_REPLICA to setting deploying_revision + triggering DEPLOYING. There are already unit tests for DeploymentService in tests/unit/manager/services/deployment/, but none exercising this method’s new control flow. Adding a unit test here (mocking start_deploying_revision() and mark_lifecycle_needed(DEPLOYING)) would prevent regressions and make the API behavior change explicit.

    async def activate_revision(
        self, action: ActivateRevisionAction
    ) -> ActivateRevisionActionResult:
        """Activate a specific revision by initiating the deployment strategy.

        Sets deploying_revision and transitions the deployment to DEPLOYING state.
        The coordinator will execute the configured deployment strategy (rolling update,
        blue-green, etc.) and swap deploying_revision → current_revision on completion.

        Args:
            action: Action containing deployment and revision IDs

        Returns:
            ActivateRevisionActionResult: Result containing the updated deployment
        """
        # 1. Validate revision exists (raises exception if not found)
        _revision = await self._deployment_repository.get_revision(action.revision_id)

        # 2. Set deploying_revision and transition to DEPLOYING lifecycle
        previous_revision_id = await self._deployment_repository.start_deploying_revision(
            action.deployment_id, action.revision_id
        )

        # 3. Trigger DEPLOYING lifecycle to start strategy execution
        await self._deployment_controller.mark_lifecycle_needed(DeploymentLifecycleType.DEPLOYING)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-03T07:33:14Z

src/ai/backend/manager/sokovan/deployment/coordinator.py

+                await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history(
+                    [], BulkCreator(specs=skipped_history_specs)
+                )


Same issue as the evaluation-error history block: calling update_endpoint_lifecycle_bulk_with_history([], ...) will not record SKIPPED history because the DB layer returns early when batch_updaters is empty. Use a history-only insertion path or pass a no-op updater so the transaction writes history rows.

src/ai/backend/manager/repositories/deployment/db_source/db_source.py

Copilot · 2026-03-03T07:33:15Z

src/ai/backend/manager/services/deployment/service.py

+        # 2. Set deploying_revision and transition to DEPLOYING lifecycle
+        previous_revision_id = await self._deployment_repository.start_deploying_revision(
            action.deployment_id, action.revision_id
        )

-        # 3. Trigger lifecycle check to update routes with new revision
-        await self._deployment_controller.mark_lifecycle_needed(
-            DeploymentLifecycleType.CHECK_REPLICA
-        )
+        # 3. Trigger DEPLOYING lifecycle to start strategy execution
+        await self._deployment_controller.mark_lifecycle_needed(DeploymentLifecycleType.DEPLOYING)


activate_revision() now always transitions the endpoint into DEPLOYING and relies on the strategy evaluator to finish the revision swap. In this PR, both rolling_update_evaluate() and blue_green_evaluate() are stubs that raise NotImplementedError, which means any activation will leave the deployment stuck in DEPLOYING (and repeatedly error/retry). To avoid a user-visible regression, consider gating this new path behind a feature flag or keeping the previous immediate activation behavior until at least one strategy is implemented (or provide a temporary “direct swap” strategy).

Copilot · 2026-03-03T07:33:15Z

src/ai/backend/manager/sokovan/deployment/coordinator.py

+                await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history(
+                    [], BulkCreator(specs=error_history_specs)
+                )


In the evaluator path, these calls pass an empty batch_updaters list into update_endpoint_lifecycle_bulk_with_history(). The DB implementation returns early when batch_updaters is empty, so no history rows will be recorded for evaluation errors (and the method returns 0). Consider either (a) adding a dedicated repository method for inserting deployment history without lifecycle updates, or (b) supplying a minimal no-op BatchUpdater that keeps the lifecycle at DEPLOYING so history insertion still runs.

Please check this comment and resolve if it is false

HyeockJinKim · 2026-03-03T07:33:36Z

src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py

+            all_scale_out.extend(changes.scale_out_specs)
+            all_scale_in_ids.extend(changes.scale_in_route_ids)
+
+            # Group by sub-step
+            if cycle_result.completed:
+                result.completed.append(deployment)
+                result.completed_strategies[deployment.id] = policy.strategy
+            else:
+                group = result.groups.setdefault(
+                    cycle_result.sub_step,
+                    EvaluationGroup(sub_step=cycle_result.sub_step),
+                )
+                group.deployments.append(deployment)


Scale out and scale in should operate independently of deploy, right? Is it correct that scale out and scale in are included?

@HyeockJinKim This logic is used when calling activate_revision to spin up new routes for the new revision and bring down the existing routes.

Could you clarify what exactly you are expecting?

Do you prefer not to use the terms scale_in and scale_out?

Do you want this operation to be delegated to the coordinator?

Or do you feel the overall structure itself is incorrect?

Implement the blue-green FSM in blue_green_evaluate() with 8-step flow: classify routes → create green (INACTIVE) → wait provisioning → rollback if all failed → wait healthy < desired → manual wait → delay check → atomic promotion (green→ACTIVE + blue→TERMINATING). Key changes: - Add promote_route_ids to RouteChanges for green→ACTIVE promotion - Add status_updated_at to RoutingRow/RouteInfo for promote_delay_seconds - Add fetch_routes_by_endpoint_ids (no status filter) for rollback detection - Extend scale_routes with promote_updater parameter - Auto-set status_updated_at in RouteBatchUpdaterSpec on status change - Add alembic migration for status_updated_at column Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit 23732f4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…_deployment_polices`

src/ai/backend/manager/repositories/deployment/db_source/db_source.py

src/ai/backend/manager/sokovan/deployment/strategy/evaluator.py

src/ai/backend/manager/repositories/deployment/db_source/db_source.py

fregataa · 2026-03-04T06:42:28Z

src/ai/backend/manager/sokovan/deployment/coordinator.py

+                await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history(
+                    [], BulkCreator(specs=error_history_specs)
+                )


Please check this comment and resolve if it is false

fregataa · 2026-03-04T06:42:41Z

src/ai/backend/manager/sokovan/deployment/coordinator.py

+                await self._deployment_repository.update_endpoint_lifecycle_bulk_with_history(
+                    [], BulkCreator(specs=skipped_history_specs)
+                )


src/ai/backend/manager/repositories/deployment/db_source/db_source.py

fregataa · 2026-03-04T07:01:25Z

src/ai/backend/manager/sokovan/deployment/coordinator.py

+            # Post-process outside recorder scope using actual sub_results (Finding 4)
+            for sub_step, group in eval_result.groups.items():
+                handler_key = (lifecycle_type, sub_step)
+                handler = self._deployment_handlers.get(handler_key)
+                if handler is None:
+                    continue
+                try:
+                    actual_result = sub_results.get(
+                        sub_step,
+                        DeploymentExecutionResult(successes=group.deployments),
+                    )
+                    await handler.post_process(actual_result)
+                except Exception as e:
+                    log.error(
+                        "Error during post-processing for sub-step {}: {}",
+                        sub_step.value,
+                        e,
+                    )


What's the purpose of this post process?

I think there appears to be an issue with the current process for handling substeps.
(Substeps should be handled in a more generalized way and should not be coupled with the evaluator implementation.)
I will review and refactor it, then follow up with another reply.

fregataa · 2026-03-04T07:06:31Z

src/ai/backend/manager/sokovan/deployment/coordinator.py

+        lifecycle_type: DeploymentLifecycleType,
+        completed: list[DeploymentInfo],
+        strategies: dict[UUID, DeploymentStrategy],
+        records: Mapping[UUID, ExecutionRecord],


Those are all necessary to determine status?

github-actions bot assigned jopemachine Mar 2, 2026

github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component labels Mar 2, 2026

jopemachine added this to the 26.3 milestone Mar 2, 2026