feat: implement risk-based orchestration engine by srija-pixel · Pull Request #157 · SKfaizan-786/sentinel-devops-agent

srija-pixel · 2026-03-01T18:11:42Z

🚀 Overview

This PR introduces an intelligent risk-based orchestration engine for container monitoring.

Instead of binary health-based decisions, the system now computes a weighted risk score (0–100) based on:

CPU utilization
Memory utilization
Restart frequency
Container health state

🧠 Key Improvements

Multi-factor risk scoring
Adaptive restart logic
Escalation to container recreation
Cooldown protection
Preventive scaling support
Race-condition safe monitoring

📊 Risk Classification

0–39 → Stable
40–69 → Warning
70–100 → Critical

🧪 Tested Scenarios

CPU stress simulation
Permanent unhealthy container
Restart escalation
Cooldown protection

🔄 Impact

Transforms the monitoring layer from reactive rule-based recovery to adaptive intelligent orchestration.

No breaking changes introduced.

Summary by CodeRabbit

Release Notes

New Features
- Automatic detection and recovery of failed containers with self-healing capabilities
- Intelligent CPU-based auto-scaling to optimize resource usage
- Comprehensive container health monitoring with risk assessment system
Improvements
- Enhanced error handling for container lifecycle operations
- More precise health metrics reporting

vercel · 2026-03-01T18:11:46Z

@srija-pixel is attempting to deploy a commit to the SK Faizanuddin's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-03-01T18:12:01Z

Note

`.coderabbit.yaml` has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)

Validation error: Unrecognized key(s) in object: 'version', 'tools', 'instructions', 'ignored_files'

⚙️ Configuration instructions

Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

📝 Walkthrough

Walkthrough

Refactored Docker container orchestration from Swarm-based service management to direct container handling. Introduced a new ContainerSupervisor class that monitors container health, detects failures, and orchestrates healing with lifecycle tracking. Enhanced error handling across Docker client modules and rewrote the backend to use dynamic risk-based container management with auto-scaling logic.

Changes

Cohort / File(s)	Summary
Error Handling Improvements `backend/docker/client.js`, `backend/docker/monitor.js`	Enhanced error handling to explicitly detect container removal (404 status/"no such container" cases) and handle them distinctly from unexpected errors. Monitor now returns CPU/memory percentages as numbers instead of strings.
Container Scaling Refactoring `backend/docker/healer.js`	Replaced Docker Swarm service-based scaling with direct container management. Function signature changed from `scaleService(serviceName, replicas)` to `scaleService(containerName, replicas)`. Now lists, filters, and manages container replicas with naming convention `<containerName>-replica-<i>`.
New Container Supervisor `backend/docker/supervisor.js`	Introduced new `ContainerSupervisor` class that periodically monitors container health, detects failures, orchestrates healing with lifecycle metadata tracking, and provides methods for health polling and lifecycle access. Exports singleton instance.
Backend Orchestration Rewrite `backend/index.js`	Replaced static health checks with dynamic container monitoring. Added `autoScaleIfNeeded()` and `computeRiskScore()` functions for automated scaling and governance-based container lifecycle management. Integrated supervisor-driven background monitoring loop with per-container restart and cooldown tracking.

Sequence Diagram(s)

sequenceDiagram
    participant Monitor as Monitor
    participant Supervisor as Supervisor
    participant Client as Docker Client
    participant Healer as Healer
    participant Index as Backend Index

    Note over Monitor,Index: Health Check & Healing Cycle

    Supervisor->>Monitor: Periodic interval (5s)
    Monitor->>Client: getContainerHealth(containerId)
    Client-->>Monitor: {status, metrics}
    
    alt Container Unhealthy or Exited
        Monitor->>Supervisor: Detect failure
        Supervisor->>Supervisor: recordLifecycle(detectedAt)
        Supervisor->>Client: restartContainer(containerId)
        Client-->>Supervisor: Restarted
        
        loop Poll every 3s (timeout: 5min)
            Supervisor->>Client: getContainerHealth(containerId)
            Client-->>Supervisor: Health status
        end
        
        alt Container Healthy
            Supervisor->>Supervisor: recordLifecycle(restoredAt)
            Supervisor-->>Index: Recovery complete
        else Timeout
            Supervisor->>Index: Healing failed
        end
    else Needs Scaling
        Index->>Index: computeRiskScore(metrics)
        Index->>Index: autoScaleIfNeeded(cpuUsage)
        Index->>Healer: scaleService(containerName, targetReplicas)
        Healer->>Client: List & manage containers
        Client-->>Healer: Container operations
        Healer-->>Index: Scaling result
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Added Docker Container Monitoring #39: Introduces initial Docker module functions (getContainerHealth, scaleService, monitor stream handling) that are now enhanced and refactored in this PR.

Suggested labels

enhancement, hard

Suggested reviewers

mdhaarishussain

Poem

🐰 A rabbit's cheer for this orchestration!
Containers now heal with dedication,
Health checks and scaling dance in sync,
Risk scores guide each thoughtful link,
No more manual care—just supervision! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: implement risk-based orchestration engine' accurately and concisely summarizes the main feature introduced by the changeset, reflecting the core implementation of risk-based container orchestration.
Description check	✅ Passed	The PR description covers the overview, key improvements, risk classification, tested scenarios, and impact, though it lacks the structured sections required by the template (Related Issues, Type of Change checkbox, How Has This Been Tested details, and Checklist).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (1)

backend/index.js (1)

155-160: Avoid duplicate scale checks in the same iteration.

Scaling is already attempted in the warning branch (Line 159), then re-attempted for high CPU (Line 176). This is redundant and makes behavior harder to reason about.

Proposed refactor

-        if (cpuUsage > SCALE_UP_CPU_THRESHOLD) {
+        if (riskScore < 60 && cpuUsage > SCALE_UP_CPU_THRESHOLD) {
           await autoScaleIfNeeded(c, cpuUsage);
         }

Also applies to: 176-178

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/index.js` around lines 155 - 160, The code currently calls
autoScaleIfNeeded(c, cpuUsage) twice in the same loop (once when riskScore >= 60
and again when CPU is high), causing duplicate scale attempts; change the logic
so autoScaleIfNeeded is invoked at most once per container per iteration—either
by combining the checks into a single boolean (e.g., needScale = riskScore >= 60
|| cpuUsage > HIGH_CPU_THRESHOLD) and calling autoScaleIfNeeded(c, cpuUsage)
only when needScale, or by using an if/else-if so the high-CPU branch skips
calling autoScaleIfNeeded if the riskScore branch already did; update references
to riskScore, cpuUsage, and autoScaleIfNeeded accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/docker/client.js`:
- Around line 48-49: The container health error check in
backend/docker/client.js assumes error.reason exists; make it defensive by
checking multiple error properties (e.g., error.reason, error.message,
error.statusMessage or parsed error body) with optional chaining and lowercasing
before comparing to 'no such container', and keep the existing statusCode ===
404 check; update the conditional in the health-checking function (the block
returning { status: 'removed', failingStreak: 0, log: [] }) to use these
fallbacks so removed containers are detected even when the error shape varies.

In `@backend/docker/healer.js`:
- Around line 57-67: The container-group matching in sameService is too broad
and unsafe: change the filter on containers to ensure c.Names exists and compare
names exactly (trim leading '/' from each name and match equality to
containerName) instead of using includes(containerName), and before using
sameService[0] in SCALE UP add a null/empty guard (e.g., throw or log and abort
scaling if sameService.length === 0) prior to calling docker.getContainer and
baseContainer.inspect to avoid runtime errors or scaling the wrong group.
- Around line 73-77: scaleService() currently clones the base container's config
but omits Labels and copies HostConfig including PortBindings, which causes new
replicas to be excluded from monitoring and to conflict on host ports; when
calling docker.createContainer to make newContainer, include info.Config.Labels
(or at least the sentinel.monitor=true label) in the container spec so replicas
are monitored, and remove or clear HostConfig.PortBindings (e.g., delete
HostConfig.PortBindings or set it to an empty object) before passing HostConfig
to docker.createContainer to avoid duplicating host port bindings.

In `@backend/docker/monitor.js`:
- Around line 27-33: The stream error handling should also detect "no such
container" when the text is in err.message as well as err.reason; update the
conditional around the error handler that currently checks err.statusCode and
err.reason to instead test for statusCode === 404 or a case-insensitive match of
"no such container" against (err.reason || err.message), and then call
this.stopMonitoring(containerId) and return as before (references:
this.stopMonitoring, containerId, err).

In `@backend/docker/supervisor.js`:
- Around line 49-59: The waitForHealthy loop currently waits the full timeout
even when the container enters terminal non-healthy states; update the loop in
waitForHealthy to check the health.status returned by
getContainerHealth(containerId) for terminal states (e.g., 'removed', 'error',
'dead', 'exited' or any non-recoverable status you use) and immediately return
false (or fail fast) when one of those states is seen instead of continuing to
sleep until timeout; keep the existing healthy check that returns true when
status === 'healthy'.
- Line 20: The call to this.handleFailure(c.id) should be awaited and its
rejection handled to avoid unhandled promise rejections; change invocations of
handleFailure to await this.handleFailure(containerId) inside a try/catch and
log or handle any errors. Also wrap the async healing logic that manipulates
healingInProgress (the block around where healingInProgress.add(containerId) and
healingInProgress.delete(containerId) are used) in a try/finally so
healingInProgress.delete(containerId) always runs; in the try handle normal
async steps and in catch log errors, and in finally remove the containerId from
healingInProgress to prevent stale state blocking future healing attempts.

In `@backend/index.js`:
- Around line 62-63: The code sets the cooldown in scaleTracker unconditionally
after calling healer.scaleService, which blocks retries even when scaling
failed; update the call sites that currently do await
healer.scaleService(serviceName, currentReplicas + 1) / await
healer.scaleService(serviceName, currentReplicas - 1) so that you capture the
result (e.g., const res = await healer.scaleService(...)) and only call
scaleTracker.set(serviceName, Date.now()) when res.success is truthy (and
optionally log or handle res.error when false); ensure both the scale-up and
scale-down branches use this success check so failed scale attempts don't start
the cooldown.
- Around line 130-133: The risk routing condition using riskScore currently
treats Critical as >=80 and Warning as >=60; update the conditional logic in
backend/index.js (the blocks checking riskScore, e.g., the if that logs "🚨
CRITICAL risk for ${c.name}" and the corresponding Warning branch around lines
referenced) so Critical is riskScore >= 70 and Warning is when riskScore is
between 40 and 69 inclusive (i.e., >=40 && <=69); ensure any else/other branches
cover <40 as normal/low and apply the same fixes to the second occurrence noted
(also applies to the other riskScore check).
- Around line 110-112: The cooldown is stored under the old container id but
checked against the current container id (cooldownTracker.get(c.id)), so
recreated containers bypass cooldown; fix by keying cooldowns with a stable
identifier used in both places (e.g., container image or container name) instead
of the transient container id: choose a stable key (like const key = c.Image ||
c.Names?.[0]) and replace both cooldownTracker.get(c.id) and the place that sets
the cooldown (the code that currently uses the removed container id) to use
cooldownTracker.get(key) and cooldownTracker.set(key, ...), ensuring the same
symbol (the chosen key) is used for lookup and storage.
- Around line 50-57: The replica count logic uses exact name matches (c.name ===
serviceName) but replicas are created with a "-replica-*" suffix, causing
incorrect currentReplicas; update the filter in the autoscaling block (where
serviceName, listContainers, sameService, currentReplicas, scaleTracker and
SCALE_COOLDOWN_MS are used) to include both the base service name and replica
names (e.g., c.name === serviceName || c.name.startsWith(serviceName +
'-replica-') or use a regex like ^serviceName(-replica-\d+)?$) so
currentReplicas correctly counts all replicas.
- Around line 101-193: The background loop uses setInterval which allows
overlapping runs and races on restartTracker, cooldownTracker, scaleTracker and
concurrent calls to
healer.restartContainer/healer.recreateContainer/autoScaleIfNeeded; replace the
setInterval pattern with a single async loop (e.g., an async function that
awaits a delay each iteration) or implement per-container locks (a Map of
mutexes/Promise-chains) and acquire the lock at the start of processing each
container id, releasing it after all actions (monitor.startMonitoring,
computeRiskScore, healer.restartContainer, healer.recreateContainer,
autoScaleIfNeeded, monitor.stopMonitoring) complete; ensure mutations to
restartTracker/cooldownTracker/scaleTracker and calls to healer.* are performed
only while the lock is held so no two iterations can act on the same container
concurrently.

---

Nitpick comments:
In `@backend/index.js`:
- Around line 155-160: The code currently calls autoScaleIfNeeded(c, cpuUsage)
twice in the same loop (once when riskScore >= 60 and again when CPU is high),
causing duplicate scale attempts; change the logic so autoScaleIfNeeded is
invoked at most once per container per iteration—either by combining the checks
into a single boolean (e.g., needScale = riskScore >= 60 || cpuUsage >
HIGH_CPU_THRESHOLD) and calling autoScaleIfNeeded(c, cpuUsage) only when
needScale, or by using an if/else-if so the high-CPU branch skips calling
autoScaleIfNeeded if the riskScore branch already did; update references to
riskScore, cpuUsage, and autoScaleIfNeeded accordingly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5dc7cd4 and 155fb9b.

⛔ Files ignored due to path filters (1)

backend/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (5)

backend/docker/client.js
backend/docker/healer.js
backend/docker/monitor.js
backend/docker/supervisor.js
backend/index.js

coderabbitai · 2026-03-01T18:19:58Z

+    if (error.statusCode === 404 || error.reason === 'no such container') {
+      return { status: 'removed', failingStreak: 0, log: [] };


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's check if the file exists and read the relevant section if [ -f "backend/docker/client.js" ]; then echo "=== Lines 40-60 of backend/docker/client.js ===" sed -n '40,60p' backend/docker/client.js | cat -n else # Try to find the file find . -name "client.js" -type f | head -20 fi

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 919

🏁 Script executed:

#!/bin/bash # Also search for error handling patterns in this file echo "=== Searching for error handling patterns ===" rg -n "error\.(reason|message|json)" backend/docker/client.js -A 2 -B 2

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 545

🏁 Script executed:

#!/bin/bash # Search for tests or documentation about error structures echo "=== Looking for test files or documentation ===" fd -e test.js -e spec.js -e .test.js backend/ | head -10 find backend -name "*.md" -o -name "*.test.js" | head -10

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 480

🏁 Script executed:

#!/bin/bash # Check the imports and initialization of docker client echo "=== Looking at imports and docker initialization ===" head -50 backend/docker/client.js | cat -n

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 2015

🏁 Script executed:

#!/bin/bash # Look for Docker library in package.json echo "=== Checking package.json for Docker dependencies ===" rg "docker|moby" package.json -i

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 201

🏁 Script executed:

#!/bin/bash # Search for where getContainerHealth is called to see error context echo "=== Searching for getContainerHealth usage ===" rg "getContainerHealth" backend/ -B 3 -A 3

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1784

🏁 Script executed:

#!/bin/bash # Search for any error structure documentation or examples echo "=== Searching for error handling patterns ===" rg "error\." backend/docker/client.js -B 1 -A 1 | head -40

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 451

🏁 Script executed:

#!/bin/bash # Look for test files more broadly echo "=== Searching for test files ===" find . -type f -name "*.test.js" -o -name "*.spec.js" | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 218

🏁 Script executed:

#!/bin/bash # Look for any error fixtures or mock data echo "=== Searching for error-related test data ===" rg "no such container" . -l

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 178

🏁 Script executed:

#!/bin/bash # Look at the root directory to understand structure echo "=== Checking repository structure ===" ls -la | head -30

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1483

🏁 Script executed:

#!/bin/bash # Check if package.json exists elsewhere echo "=== Finding package.json ===" find . -name "package.json" -type f

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 303

🏁 Script executed:

#!/bin/bash # Check backend package.json for dockerode version echo "=== backend/package.json ===" cat backend/package.json

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 760

🏁 Script executed:

#!/bin/bash # Check monitor.js for context about how it uses getContainerHealth echo "=== Checking monitor.js ===" rg "no such container" backend/docker/monitor.js -B 5 -A 5

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 440

🏁 Script executed:

#!/bin/bash # Look at one of the test files to see if there are mock errors echo "=== Checking test files for error patterns ===" head -100 cli/tests/heal.test.js | cat -n

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 3947

🌐 Web query:

dockerode error structure "no such container" message reason

💡 Result:

In dockerode, the error you get for a missing container is a normal Docker Engine HTTP 404 (“no such container”) surfaced through docker-modem.

What the error object looks like (fields)

docker-modem builds an Error and adds these extra fields when the HTTP status code is not in the “success” set: reason, statusCode, and json. The human-readable err.message is formatted from those pieces too. [1]

So for “no such container”, you typically see something like:

{ name: 'Error', message: '(HTTP code 404) no such container - No such container: <id> ', reason: 'no such container', statusCode: 404, json: { message: 'No such container: <id>' } }

Where the “reason” comes from

err.reason is not coming from Docker itself; it comes from the statusCodes mapping that dockerode passes into docker-modem (for 404 it’s commonly the string 'no such container'). docker-modem assigns that mapping value to msg.reason. [1]

Where the “message” comes from

Docker Engine’s API error responses include a JSON body with a message field (for example: {"message":"No such container: c2ada9df5af8"}), and docker-modem stores the parsed body on err.json and also uses json.message when composing err.message. [1][2]

Why it’s 404

Docker Engine endpoints that reference a container return 404 – no such container when the container id/name doesn’t exist. [3]

Sources: [1] docker-modem buildPayload implementation (adds reason, statusCode, json and formats message), [2] example of Docker API “No such container” JSON error body, [3] Docker Engine API docs showing 404 “no such container”.

🏁 Script executed:

#!/bin/bash # Let's also check if there are any comments or examples in the codebase about error structures echo "=== Searching for error structure examples ===" rg "error\.(statusCode|reason|message|json)" backend/ -B 2 -A 2

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 12911

🏁 Script executed:

#!/bin/bash # Check if there's any documentation in the repo about Docker errors echo "=== Checking documentation files ===" grep -r "no such container" docs/ 2>/dev/null || echo "Not found in docs" grep -r "Docker error" docs/ 2>/dev/null || echo "Not found in docs"

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 150

Add defensive fallbacks for error message checking in container health errors.

Line 48 checks error.statusCode === 404 as the primary condition, which correctly catches the "no such container" case. However, the secondary check error.reason === 'no such container' assumes the reason property is always present. While dockerode/docker-modem should set this property, a more defensive approach would check multiple error properties as fallbacks to ensure robustness:

Suggested fix

- if (error.statusCode === 404 || error.reason === 'no such container') { + const reason = `${error.reason || error.message || error.json?.message || ''}`.toLowerCase(); + if (error.statusCode === 404 || reason.includes('no such container')) { return { status: 'removed', failingStreak: 0, log: [] }; }

This handles edge cases where the error structure varies, ensuring removed containers are always correctly identified regardless of which error property contains the message.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/docker/client.js` around lines 48 - 49, The container health error check in backend/docker/client.js assumes error.reason exists; make it defensive by checking multiple error properties (e.g., error.reason, error.message, error.statusMessage or parsed error body) with optional chaining and lowercasing before comparing to 'no such container', and keep the existing statusCode === 404 check; update the conditional in the health-checking function (the block returning { status: 'removed', failingStreak: 0, log: [] }) to use these fallbacks so removed containers are detected even when the error shape varies.

coderabbitai · 2026-03-01T18:19:59Z

+    const sameService = containers.filter(c =>
+      c.Names.some(n => n.includes(containerName))
+    );
+
+    const currentCount = sameService.length;
+
+    // SCALE UP
+    if (replicas > currentCount) {
+
+      const baseContainer = docker.getContainer(sameService[0].Id);
+      const info = await baseContainer.inspect();


⚠️ Potential issue | 🟠 Major

Container-group matching is too broad and not null-safe.

Line 58 uses substring matching (includes(containerName)), which can match unrelated names, and Line 66 assumes at least one match exists. That can scale the wrong group or throw on scale-up.

Proposed fix

- const sameService = containers.filter(c => - c.Names.some(n => n.includes(containerName)) - ); + const sameService = containers.filter(c => + c.Names.some(n => { + const name = n.replace(/^\//, ''); + return name === containerName || name.startsWith(`${containerName}-replica-`); + }) + ); + + if (sameService.length === 0) { + return { success: false, error: `No containers found for ${containerName}` }; + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/docker/healer.js` around lines 57 - 67, The container-group matching in sameService is too broad and unsafe: change the filter on containers to ensure c.Names exists and compare names exactly (trim leading '/' from each name and match equality to containerName) instead of using includes(containerName), and before using sameService[0] in SCALE UP add a null/empty guard (e.g., throw or log and abort scaling if sameService.length === 0) prior to calling docker.getContainer and baseContainer.inspect to avoid runtime errors or scaling the wrong group.

coderabbitai · 2026-03-01T18:19:59Z

+        const newContainer = await docker.createContainer({
+          Image: info.Config.Image,
+          name: newName,
+          Env: info.Config.Env,
+          HostConfig: info.HostConfig


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the healer.js file around lines 73-77 head -n 85 backend/docker/healer.js | tail -n 20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 617

🏁 Script executed:

# Also check the broader context of the healer.js file to understand the function head -n 95 backend/docker/healer.js | tail -n 30

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 927

🏁 Script executed:

# Now let's look at client.js line 12 to understand the listContainers filtering head -n 25 backend/docker/client.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 882

🏁 Script executed:

# Let's also search for where info.Config.Labels is used or referenced rg "Config\.Labels" backend/docker/

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 60

🏁 Script executed:

# Check if PortBindings is ever explicitly handled elsewhere rg "PortBindings" backend/docker/

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 60

🏁 Script executed:

# Check if labels are set when creating the initial container rg "sentinel\.monitor" backend/

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 144

🏁 Script executed:

# Look for docker creation code to see if labels are set initially rg "createContainer" backend/docker/ -A 5

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1281

🏁 Script executed:

# Check Docker API documentation patterns - look for how containers are created elsewhere rg "Labels\s*:" backend/ -A 2 -B 2

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 60

🏁 Script executed:

# Check if there are any imports or usage of docker-compose or config files fd -t f "\.(yml|yaml|json)" backend/ | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 124

🏁 Script executed:

# Let's see the full context of both createContainer calls to understand the difference sed -n '45,105p' backend/docker/healer.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1711

🏁 Script executed:

# Check if there's a pattern for how containers are initially created with labels rg "docker.createContainer" backend/ -B 10 -A 10

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 2433

🏁 Script executed:

# Let's understand the Docker API inspect response structure - check if there are any comments or docs head -n 150 backend/docker/healer.js | tail -n 60

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1683

Add Labels and remove PortBindings when creating replicas in scaleService().

The scaleService() function creates replicas without copying Labels from the base container. Since listContainers() filters by sentinel.monitor=true label, new replicas will be skipped by monitoring. Additionally, cloning HostConfig as-is duplicates PortBindings across replicas, causing host-port binding conflicts.

Proposed fix

+ const hostConfig = { ...info.HostConfig }; + delete hostConfig.PortBindings; // avoid duplicate host-port binding conflicts const newContainer = await docker.createContainer({ Image: info.Config.Image, name: newName, Env: info.Config.Env, + Labels: info.Config.Labels, - HostConfig: info.HostConfig + HostConfig: hostConfig });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/docker/healer.js` around lines 73 - 77, scaleService() currently clones the base container's config but omits Labels and copies HostConfig including PortBindings, which causes new replicas to be excluded from monitoring and to conflict on host ports; when calling docker.createContainer to make newContainer, include info.Config.Labels (or at least the sentinel.monitor=true label) in the container spec so replicas are monitored, and remove or clear HostConfig.PortBindings (e.g., delete HostConfig.PortBindings or set it to an empty object) before passing HostConfig to docker.createContainer to avoid duplicating host port bindings.

coderabbitai · 2026-03-01T18:19:59Z

+    if (err.statusCode === 404 || err.reason === 'no such container') {
+        this.stopMonitoring(containerId);
+        return;
+    }
+
+    console.error(`Stream error for ${containerId}:`, err.message);
+    this.stopMonitoring(containerId);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "monitor.js" | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 105

🏁 Script executed:

cat -n ./backend/docker/monitor.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 4819

🏁 Script executed:

# Search for other error handling patterns in the codebase related to Docker rg "err\." --type js -B 2 -A 2 | head -100

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1554

🏁 Script executed:

# Check which Docker library is being used grep -r "require.*docker" --include="*.js" | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 401

🌐 Web query:

dockerode stream error handling removed container error structure

💡 Result:

In dockerode, the “error structure” you see is mostly defined by docker-modem (the HTTP layer used by dockerode).

1) Non-stream API calls (typical “removed container” case)

When you call a normal endpoint (e.g., container.inspect(), container.stop(), container.remove(), container.logs() without follow, etc.) and the container is already gone, Docker Engine usually replies HTTP 404 with a JSON body like:

{ "message": "No such container: <id>" }

docker-modem turns any non-2xx response into a JavaScript Error and adds these properties:

err.statusCode → the HTTP status (e.g., 404)

err.reason → whatever dockerode’s endpoint statusCodes[404] mapping is for that call (commonly something like "no such container")

err.json → the parsed response body (often the object containing "message")

And it builds err.message like:
(HTTP code 404) <reason> - <cause.message || cause.error || cause> (plus spaces). (raw.githubusercontent.com)

So you generally want to handle removed-container as:

if (err?.statusCode === 404) and then read err.json?.message.

2) Stream-based operations (pull/build/etc.) and followProgress

For streaming endpoints that you pass into:

docker.modem.followProgress(stream, onFinished, onProgress)

followProgress collects JSON lines from the stream into output[], and:

if the stream emits an 'error' event, it calls onFinished(err, output)

if the stream ends/closes normally, it calls onFinished(null, output) (raw.githubusercontent.com)

Meaning: if a container is removed and the daemon simply closes an attached/logs stream, you may get end/close (no error) rather than a structured HTTP error; you only get the structured (HTTP code ...) error object when there is an HTTP response for a request that failed (like an inspect/remove call). (raw.githubusercontent.com)

Citations:

1: https://raw.githubusercontent.com/apocas/docker-modem/master/lib/modem.js

2: https://raw.githubusercontent.com/apocas/docker-modem/master/lib/modem.js

3: https://raw.githubusercontent.com/apocas/docker-modem/master/lib/modem.js

🏁 Script executed:

# Look for other places where stream errors might be handled or logged rg "stream\.on\('error'" --type js -B 3 -A 5

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 577

Match removed-container errors in both err.reason and err.message.

Line 27 only checks err.statusCode === 404 or err.reason === 'no such container'. Docker stream errors from dockerode can place error details in err.message instead of (or in addition to) err.reason, so removed containers may incorrectly trigger the generic stream error handler.

Proposed fix

- if (err.statusCode === 404 || err.reason === 'no such container') { + const reason = `${err.reason || err.message || err.json?.message || ''}`.toLowerCase(); + if (err.statusCode === 404 || reason.includes('no such container')) { this.stopMonitoring(containerId); return; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/docker/monitor.js` around lines 27 - 33, The stream error handling should also detect "no such container" when the text is in err.message as well as err.reason; update the conditional around the error handler that currently checks err.statusCode and err.reason to instead test for statusCode === 404 or a case-insensitive match of "no such container" against (err.reason || err.message), and then call this.stopMonitoring(containerId) and return as before (references: this.stopMonitoring, containerId, err).

coderabbitai · 2026-03-01T18:19:59Z

+                (health.status === 'unhealthy' || c.status === 'exited') &&
+                !this.healingInProgress.has(c.id)
+            ) {
+                this.handleFailure(c.id);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file fd "supervisor.js" --type f

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 106

🏁 Script executed:

# Read the supervisor.js file to see the actual code cat -n backend/docker/supervisor.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 2471

Add error handling and try/finally to prevent unhandled rejections and state leaks.

Line 20 calls handleFailure() without awaiting or catching—if the promise rejects, you get an unhandled rejection. Lines 25-44 lack try/finally, so if any async operation throws, healingInProgress.delete(containerId) on line 43 never runs, leaving stale state that blocks future healing attempts for that container.

Proposed fix

- this.handleFailure(c.id); + void this.handleFailure(c.id).catch(err => { + console.error(`Healing flow failed for ${c.id}:`, err.message); + }); async handleFailure(containerId) { - this.healingInProgress.add(containerId); - - this.lifecycle.set(containerId, { - detectedAt: Date.now(), - healingTriggeredAt: Date.now(), - restoredAt: null - }); - - await restartContainer(containerId); - - const restored = await this.waitForHealthy(containerId); - - if (restored) { - const record = this.lifecycle.get(containerId); - record.restoredAt = Date.now(); - } - - this.healingInProgress.delete(containerId); + this.healingInProgress.add(containerId); + try { + this.lifecycle.set(containerId, { + detectedAt: Date.now(), + healingTriggeredAt: Date.now(), + restoredAt: null + }); + + await restartContainer(containerId); + const restored = await this.waitForHealthy(containerId); + + if (restored) { + const record = this.lifecycle.get(containerId); + record.restoredAt = Date.now(); + } + } finally { + this.healingInProgress.delete(containerId); + } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

this.handleFailure(c.id);

void this.handleFailure(c.id).catch(err => {

console.error(`Healing flow failed for ${c.id}:`, err.message);

});

async handleFailure(containerId) {

this.healingInProgress.add(containerId);

try {

this.lifecycle.set(containerId, {

detectedAt: Date.now(),

healingTriggeredAt: Date.now(),

restoredAt: null

});

await restartContainer(containerId);

const restored = await this.waitForHealthy(containerId);

if (restored) {

const record = this.lifecycle.get(containerId);

record.restoredAt = Date.now();

}

} finally {

this.healingInProgress.delete(containerId);

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/docker/supervisor.js` at line 20, The call to this.handleFailure(c.id) should be awaited and its rejection handled to avoid unhandled promise rejections; change invocations of handleFailure to await this.handleFailure(containerId) inside a try/catch and log or handle any errors. Also wrap the async healing logic that manipulates healingInProgress (the block around where healingInProgress.add(containerId) and healingInProgress.delete(containerId) are used) in a try/finally so healingInProgress.delete(containerId) always runs; in the try handle normal async steps and in catch log errors, and in finally remove the containerId from healingInProgress to prevent stale state blocking future healing attempts.

coderabbitai · 2026-03-01T18:19:59Z

+  const serviceName = container.name;
+  const lastScale = scaleTracker.get(serviceName) || 0;

-      if (
-        systemStatus.services[service.name].status !== newStatus ||
-        systemStatus.services[service.name].code !== newCode
-      ) {
-        const prevStatus = systemStatus.services[service.name].status;
-
-        // Log Status Changes
-        if (newStatus === 'healthy' && prevStatus !== 'healthy' && prevStatus !== 'unknown') {
-          logActivity('success', `Service ${service.name} recovered to HEALTHY`);
-        } else if (newStatus !== 'healthy' && prevStatus !== newStatus) {
-          const severity = newStatus === 'critical' ? 'alert' : 'warn';
-          logActivity(severity, `Service ${service.name} is ${newStatus.toUpperCase()} (Code: ${newCode})`);
-        }
+  if (Date.now() - lastScale < SCALE_COOLDOWN_MS) return;

-        systemStatus.services[service.name] = {
-          status: newStatus,
-          code: newCode,
-          lastUpdated: new Date()
-        };
-        hasChanges = true;
-
-        // Broadcast individual service update
-        wsBroadcaster.broadcast('SERVICE_UPDATE', {
-          name: service.name,
-          ...systemStatus.services[service.name]
-        });
-      }
-    }
+  const allContainers = await listContainers();
+  const sameService = allContainers.filter(c => c.name === serviceName);
+  const currentReplicas = sameService.length;


⚠️ Potential issue | 🟠 Major

Replica grouping in autoscaling is inconsistent with replica naming.

Line 56 counts only exact name matches (c.name === serviceName), while replicas are named with -replica-* in backend/docker/healer.js Line 71. This can send incorrect target replica counts and cause scale oscillation.

Proposed fix

async function autoScaleIfNeeded(container, cpuUsage) { - const serviceName = container.name; + const serviceName = container.name.split('-replica-')[0]; const lastScale = scaleTracker.get(serviceName) || 0; if (Date.now() - lastScale < SCALE_COOLDOWN_MS) return; const allContainers = await listContainers(); - const sameService = allContainers.filter(c => c.name === serviceName); + const sameService = allContainers.filter(c => c.name.split('-replica-')[0] === serviceName); const currentReplicas = sameService.length;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const serviceName = container.name;

const lastScale = scaleTracker.get(serviceName) || 0;

if (

systemStatus.services[service.name].status !== newStatus ||

systemStatus.services[service.name].code !== newCode

) {

const prevStatus = systemStatus.services[service.name].status;

// Log Status Changes

if (newStatus === 'healthy' && prevStatus !== 'healthy' && prevStatus !== 'unknown') {

logActivity('success', `Service ${service.name} recovered to HEALTHY`);

} else if (newStatus !== 'healthy' && prevStatus !== newStatus) {

const severity = newStatus === 'critical' ? 'alert' : 'warn';

logActivity(severity, `Service ${service.name} is ${newStatus.toUpperCase()} (Code: ${newCode})`);

}

if (Date.now() - lastScale < SCALE_COOLDOWN_MS) return;

systemStatus.services[service.name] = {

status: newStatus,

code: newCode,

lastUpdated: new Date()

};

hasChanges = true;

// Broadcast individual service update

wsBroadcaster.broadcast('SERVICE_UPDATE', {

name: service.name,

...systemStatus.services[service.name]

});

}

}

const allContainers = await listContainers();

const sameService = allContainers.filter(c => c.name === serviceName);

const currentReplicas = sameService.length;

const serviceName = container.name.split('-replica-')[0];

const lastScale = scaleTracker.get(serviceName) || 0;

if (Date.now() - lastScale < SCALE_COOLDOWN_MS) return;

const allContainers = await listContainers();

const sameService = allContainers.filter(c => c.name.split('-replica-')[0] === serviceName);

const currentReplicas = sameService.length;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/index.js` around lines 50 - 57, The replica count logic uses exact name matches (c.name === serviceName) but replicas are created with a "-replica-*" suffix, causing incorrect currentReplicas; update the filter in the autoscaling block (where serviceName, listContainers, sameService, currentReplicas, scaleTracker and SCALE_COOLDOWN_MS are used) to include both the base service name and replica names (e.g., c.name === serviceName || c.name.startsWith(serviceName + '-replica-') or use a regex like ^serviceName(-replica-\d+)?$) so currentReplicas correctly counts all replicas.

coderabbitai · 2026-03-01T18:19:59Z

+    await healer.scaleService(serviceName, currentReplicas + 1);
+    scaleTracker.set(serviceName, Date.now());


⚠️ Potential issue | 🟠 Major

Only apply scale cooldown after successful scaling.

Lines 63 and 71 set cooldown unconditionally. healer.scaleService() returns { success: false } on errors, so failed actions still block retries for the cooldown window.

Proposed fix

- await healer.scaleService(serviceName, currentReplicas + 1); - scaleTracker.set(serviceName, Date.now()); + const upResult = await healer.scaleService(serviceName, currentReplicas + 1); + if (upResult.success) { + scaleTracker.set(serviceName, Date.now()); + } else { + console.error(`Scale UP failed for ${serviceName}:`, upResult.error); + } return; } // SCALE DOWN if (cpuUsage < SCALE_DOWN_CPU_THRESHOLD && currentReplicas > MIN_REPLICAS) { console.log(`📉 Scaling DOWN ${serviceName} (CPU ${cpuUsage}%)`); - await healer.scaleService(serviceName, currentReplicas - 1); - scaleTracker.set(serviceName, Date.now()); + const downResult = await healer.scaleService(serviceName, currentReplicas - 1); + if (downResult.success) { + scaleTracker.set(serviceName, Date.now()); + } else { + console.error(`Scale DOWN failed for ${serviceName}:`, downResult.error); + } }

Also applies to: 70-71

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/index.js` around lines 62 - 63, The code sets the cooldown in scaleTracker unconditionally after calling healer.scaleService, which blocks retries even when scaling failed; update the call sites that currently do await healer.scaleService(serviceName, currentReplicas + 1) / await healer.scaleService(serviceName, currentReplicas - 1) so that you capture the result (e.g., const res = await healer.scaleService(...)) and only call scaleTracker.set(serviceName, Date.now()) when res.success is truthy (and optionally log or handle res.error when false); ensure both the scale-up and scale-down branches use this success check so failed scale attempts don't start the cooldown.

coderabbitai · 2026-03-01T18:19:59Z

+setInterval(async () => {
+  try {
+    const containers = await listContainers();
+
+    for (const c of containers) {
+
+      try {
+
+        // Skip if in recreate cooldown
+        const cooldownUntil = cooldownTracker.get(c.id);
+        if (cooldownUntil && Date.now() < cooldownUntil) {
+          continue;
        }

-        systemStatus.services[serviceName].status = newStatus;
-        systemStatus.services[serviceName].lastUpdated = new Date();
-      }
-    });
-    // Broadcast metrics update from webhook
-    wsBroadcaster.broadcast('METRICS', systemStatus);
-  }
-  res.json({ success: true });
-});
+        // Ensure monitoring is active
+        await monitor.startMonitoring(c.id);

-app.post('/api/action/:service/:type', async (req, res) => {
-  const { service, type } = req.params;
-  const serviceMap = { 'auth': 3001, 'payment': 3002, 'notification': 3003 };
-  const port = serviceMap[service];
+        const metrics = monitor.getMetrics(c.id);
+        if (!metrics) continue;

-  logActivity('info', `Triggering action '${type}' on service '${service}'`);
+        const cpuUsage = parseFloat(metrics.cpu) || 0;

-  if (!port) {
-    logActivity('warn', `Failed action '${type}': Invalid service '${service}'`);
-    return res.status(400).json({ success: false, error: 'Invalid service' });
-  }
+        const id = c.id;
+        const now = Date.now();
+        let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };
+        const riskScore = computeRiskScore(c, metrics, tracker.attempts);

-  try {
-    let mode = 'healthy';
-    if (type === 'crash' || type === 'down') mode = 'down';
-    if (type === 'degraded') mode = 'degraded';
-    if (type === 'slow') mode = 'slow';
-
-    await axios.post(`http://localhost:${port}/simulate/${mode}`, {}, { timeout: 5000 });
-    // Force a health check to update status immediately
-    await checkServiceHealth();
-
-    logActivity('success', `Successfully executed '${type}' on ${service}`);
-    res.json({ success: true, message: `${type} executed on ${service}` });
-  } catch (error) {
-    logActivity('error', `Action '${type}' on ${service} failed: ${error.message}`);
-    res.status(500).json({ success: false, error: error.message });
-  }
-});
+console.log(`📊 ${c.name} Risk Score: ${riskScore}`);

-// --- DOCKER ENDPOINTS ---
+if (riskScore >= 80) {

-// Middleware for ID/Service validation (mock auth for docker endpoints)
-const requireDockerAuth = (req, res, next) => {
-  // In a real app, check 'Authorization' header
-  // For now, assume authenticated if internal or trusted
-  next();
-};
+  console.log(`🚨 CRITICAL risk for ${c.name}`);

-const validateId = (req, res, next) => {
-  if (!req.params.id || typeof req.params.id !== 'string' || req.params.id.length < 1) {
-    return res.status(400).json({ error: 'Invalid ID provided' });
-  }
-  next();
-};
+  if (tracker.attempts < MAX_RESTARTS) {

-const validateScaleParams = (req, res, next) => {
-  const replicas = parseInt(req.params.replicas, 10);
-  if (!req.params.service || isNaN(replicas) || replicas < 0 || replicas > 100) {
-    return res.status(400).json({ error: 'Invalid scale parameters' });
-  }
-  next();
-};
+    tracker.attempts++;
+    tracker.lastAttempt = now;
+    restartTracker.set(id, tracker);

-app.get('/api/docker/containers', async (req, res) => {
-  try {
-    const containers = await listContainers();
-    // Use Promise.allSettled to handle monitoring setup concurrently without crashing
-    await Promise.allSettled(containers.map(c => monitor.startMonitoring(c.id)));
-
-    // Enrich with smart restart meta
-    const enrichedContainers = containers.map(c => {
-      const tracker = restartTracker.get(c.id) || { attempts: 0, lastAttempt: 0 };
-      return {
-        ...c,
-        metrics: monitor.getMetrics(c.id), // Include current metrics snapshot
-        restartCount: tracker.attempts,
-        lastRestart: tracker.lastAttempt
-      };
-    });
-
-    res.json({ containers: enrichedContainers });
-  } catch (error) {
-    res.status(500).json({ error: error.message });
-  }
-});
+    await healer.restartContainer(id);

-app.get('/api/docker/health/:id', validateId, async (req, res) => {
-  try {
-    const health = await getContainerHealth(req.params.id);
-    res.json(health);
-  } catch (error) {
-    res.status(500).json({ error: error.message });
-  }
-});
+  } else {

-app.get('/api/docker/metrics/:id', validateId, (req, res) => {
-  const metrics = monitor.getMetrics(req.params.id);
-  res.json(metrics || { error: 'No metrics available' });
-});
+    monitor.stopMonitoring(id);
+    restartTracker.delete(id);

-app.post('/api/docker/try-restart/:id', requireDockerAuth, validateId, async (req, res) => {
-  const id = req.params.id;
-  const now = Date.now();
-  let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };
+    await healer.recreateContainer(id);

-  // Reset attempts if outside grace period
-  if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {
-    tracker.attempts = 0;
+    cooldownTracker.set(id, Date.now() + COOLDOWN_AFTER_RECREATE_MS);
  }

-  if (tracker.attempts >= MAX_RESTARTS) {
-    return res.status(429).json({
-      allowed: false,
-      reason: 'Max restart attempts exceeded',
-      nextRetry: new Date(tracker.lastAttempt + GRACE_PERIOD_MS)
-    });
+  continue;
+}
+
+if (riskScore >= 60) {
+
+  console.log(`⚠️ WARNING risk for ${c.name}`);
+
+  await autoScaleIfNeeded(c, cpuUsage);
+}
+        // Reset restart counter after grace period
+        if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {
+          tracker.attempts = 0;
+        }
+
+        /* ===============================
+           1️⃣ UNHEALTHY → RESTART / RECREATE
+        ================================= */
+
+
+
+        /* ===============================
+           2️⃣ HIGH CPU → SCALE
+        ================================= */
+
+        if (cpuUsage > SCALE_UP_CPU_THRESHOLD) {
+          await autoScaleIfNeeded(c, cpuUsage);
+        }
+
+      } catch (innerErr) {
+
+        if (innerErr.statusCode === 404) continue;
+
+        console.error("Container loop error:", innerErr.message);
+      }
+    }
+
+  } catch (err) {
+    console.error("Background monitor error:", err.message);
  }

-  tracker.attempts++;
-  tracker.lastAttempt = now;
-  restartTracker.set(id, tracker);
+}, 5000);



⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "backend" -o -name "index.js" | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 231

🏁 Script executed:

wc -l ./backend/index.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 100

🏁 Script executed:

cat -n ./backend/index.js | sed -n '101,193p'

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 2892

🏁 Script executed:

cat -n ./backend/index.js | head -100

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 3631

🏁 Script executed:

cat -n ./backend/index.js | tail -30

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 992

Overlapping monitor cycles create race conditions on container restart/recreate and scaling operations.

When the async callback takes longer than 5000ms, setInterval fires the next iteration before the previous completes. This allows concurrent mutations to restartTracker, cooldownTracker, and scaleTracker, and multiple healer.restartContainer() or healer.recreateContainer() calls on the same container.

Proposed fix

+let monitorLoopInFlight = false; setInterval(async () => { + if (monitorLoopInFlight) return; + monitorLoopInFlight = true; try { const containers = await listContainers(); ... } catch (err) { console.error("Background monitor error:", err.message); + } finally { + monitorLoopInFlight = false; } }, 5000);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

setInterval(async () => {

try {

const containers = await listContainers();

for (const c of containers) {

try {

// Skip if in recreate cooldown

const cooldownUntil = cooldownTracker.get(c.id);

if (cooldownUntil && Date.now() < cooldownUntil) {

continue;

}

systemStatus.services[serviceName].status = newStatus;

systemStatus.services[serviceName].lastUpdated = new Date();

}

});

// Broadcast metrics update from webhook

wsBroadcaster.broadcast('METRICS', systemStatus);

}

res.json({ success: true });

});

// Ensure monitoring is active

await monitor.startMonitoring(c.id);

app.post('/api/action/:service/:type', async (req, res) => {

const { service, type } = req.params;

const serviceMap = { 'auth': 3001, 'payment': 3002, 'notification': 3003 };

const port = serviceMap[service];

const metrics = monitor.getMetrics(c.id);

if (!metrics) continue;

logActivity('info', `Triggering action '${type}' on service '${service}'`);

const cpuUsage = parseFloat(metrics.cpu) || 0;

if (!port) {

logActivity('warn', `Failed action '${type}': Invalid service '${service}'`);

return res.status(400).json({ success: false, error: 'Invalid service' });

}

const id = c.id;

const now = Date.now();

let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };

const riskScore = computeRiskScore(c, metrics, tracker.attempts);

try {

let mode = 'healthy';

if (type === 'crash' || type === 'down') mode = 'down';

if (type === 'degraded') mode = 'degraded';

if (type === 'slow') mode = 'slow';

await axios.post(`http://localhost:${port}/simulate/${mode}`, {}, { timeout: 5000 });

// Force a health check to update status immediately

await checkServiceHealth();

logActivity('success', `Successfully executed '${type}' on ${service}`);

res.json({ success: true, message: `${type} executed on ${service}` });

} catch (error) {

logActivity('error', `Action '${type}' on ${service} failed: ${error.message}`);

res.status(500).json({ success: false, error: error.message });

}

});

console.log(`📊 ${c.name} Risk Score: ${riskScore}`);

// --- DOCKER ENDPOINTS ---

if (riskScore >= 80) {

// Middleware for ID/Service validation (mock auth for docker endpoints)

const requireDockerAuth = (req, res, next) => {

// In a real app, check 'Authorization' header

// For now, assume authenticated if internal or trusted

next();

};

console.log(`🚨 CRITICAL risk for ${c.name}`);

const validateId = (req, res, next) => {

if (!req.params.id || typeof req.params.id !== 'string' || req.params.id.length < 1) {

return res.status(400).json({ error: 'Invalid ID provided' });

}

next();

};

if (tracker.attempts < MAX_RESTARTS) {

const validateScaleParams = (req, res, next) => {

const replicas = parseInt(req.params.replicas, 10);

if (!req.params.service || isNaN(replicas) || replicas < 0 || replicas > 100) {

return res.status(400).json({ error: 'Invalid scale parameters' });

}

next();

};

tracker.attempts++;

tracker.lastAttempt = now;

restartTracker.set(id, tracker);

app.get('/api/docker/containers', async (req, res) => {

try {

const containers = await listContainers();

// Use Promise.allSettled to handle monitoring setup concurrently without crashing

await Promise.allSettled(containers.map(c => monitor.startMonitoring(c.id)));

// Enrich with smart restart meta

const enrichedContainers = containers.map(c => {

const tracker = restartTracker.get(c.id) || { attempts: 0, lastAttempt: 0 };

return {

...c,

metrics: monitor.getMetrics(c.id), // Include current metrics snapshot

restartCount: tracker.attempts,

lastRestart: tracker.lastAttempt

};

});

res.json({ containers: enrichedContainers });

} catch (error) {

res.status(500).json({ error: error.message });

}

});

await healer.restartContainer(id);

app.get('/api/docker/health/:id', validateId, async (req, res) => {

try {

const health = await getContainerHealth(req.params.id);

res.json(health);

} catch (error) {

res.status(500).json({ error: error.message });

}

});

} else {

app.get('/api/docker/metrics/:id', validateId, (req, res) => {

const metrics = monitor.getMetrics(req.params.id);

res.json(metrics || { error: 'No metrics available' });

});

monitor.stopMonitoring(id);

restartTracker.delete(id);

app.post('/api/docker/try-restart/:id', requireDockerAuth, validateId, async (req, res) => {

const id = req.params.id;

const now = Date.now();

let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };

await healer.recreateContainer(id);

// Reset attempts if outside grace period

if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {

tracker.attempts = 0;

cooldownTracker.set(id, Date.now() + COOLDOWN_AFTER_RECREATE_MS);

}

if (tracker.attempts >= MAX_RESTARTS) {

return res.status(429).json({

allowed: false,

reason: 'Max restart attempts exceeded',

nextRetry: new Date(tracker.lastAttempt + GRACE_PERIOD_MS)

});

continue;

}

if (riskScore >= 60) {

console.log(`⚠️ WARNING risk for ${c.name}`);

await autoScaleIfNeeded(c, cpuUsage);

}

// Reset restart counter after grace period

if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {

tracker.attempts = 0;

}

/* ===============================

1️⃣ UNHEALTHY → RESTART / RECREATE

================================= */

/* ===============================

2️⃣ HIGH CPU → SCALE

================================= */

if (cpuUsage > SCALE_UP_CPU_THRESHOLD) {

await autoScaleIfNeeded(c, cpuUsage);

}

} catch (innerErr) {

if (innerErr.statusCode === 404) continue;

console.error("Container loop error:", innerErr.message);

}

}

} catch (err) {

console.error("Background monitor error:", err.message);

}

tracker.attempts++;

tracker.lastAttempt = now;

restartTracker.set(id, tracker);

}, 5000);

let monitorLoopInFlight = false;

setInterval(async () => {

if (monitorLoopInFlight) return;

monitorLoopInFlight = true;

try {

const containers = await listContainers();

for (const c of containers) {

try {

// Skip if in recreate cooldown

const cooldownUntil = cooldownTracker.get(c.id);

if (cooldownUntil && Date.now() < cooldownUntil) {

continue;

}

// Ensure monitoring is active

await monitor.startMonitoring(c.id);

const metrics = monitor.getMetrics(c.id);

if (!metrics) continue;

const cpuUsage = parseFloat(metrics.cpu) || 0;

const id = c.id;

const now = Date.now();

let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };

const riskScore = computeRiskScore(c, metrics, tracker.attempts);

console.log(`📊 ${c.name} Risk Score: ${riskScore}`);

if (riskScore >= 80) {

console.log(`🚨 CRITICAL risk for ${c.name}`);

if (tracker.attempts < MAX_RESTARTS) {

tracker.attempts++;

tracker.lastAttempt = now;

restartTracker.set(id, tracker);

await healer.restartContainer(id);

} else {

monitor.stopMonitoring(id);

restartTracker.delete(id);

await healer.recreateContainer(id);

cooldownTracker.set(id, Date.now() + COOLDOWN_AFTER_RECREATE_MS);

}

continue;

}

if (riskScore >= 60) {

console.log(`⚠️ WARNING risk for ${c.name}`);

await autoScaleIfNeeded(c, cpuUsage);

}

// Reset restart counter after grace period

if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {

tracker.attempts = 0;

}

/* ===============================

1️⃣ UNHEALTHY → RESTART / RECREATE

================================= */

/* ===============================

2️⃣ HIGH CPU → SCALE

================================= */

if (cpuUsage > SCALE_UP_CPU_THRESHOLD) {

await autoScaleIfNeeded(c, cpuUsage);

}

} catch (innerErr) {

if (innerErr.statusCode === 404) continue;

console.error("Container loop error:", innerErr.message);

}

}

} catch (err) {

console.error("Background monitor error:", err.message);

} finally {

monitorLoopInFlight = false;

}

}, 5000);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/index.js` around lines 101 - 193, The background loop uses setInterval which allows overlapping runs and races on restartTracker, cooldownTracker, scaleTracker and concurrent calls to healer.restartContainer/healer.recreateContainer/autoScaleIfNeeded; replace the setInterval pattern with a single async loop (e.g., an async function that awaits a delay each iteration) or implement per-container locks (a Map of mutexes/Promise-chains) and acquire the lock at the start of processing each container id, releasing it after all actions (monitor.startMonitoring, computeRiskScore, healer.restartContainer, healer.recreateContainer, autoScaleIfNeeded, monitor.stopMonitoring) complete; ensure mutations to restartTracker/cooldownTracker/scaleTracker and calls to healer.* are performed only while the lock is held so no two iterations can act on the same container concurrently.

coderabbitai · 2026-03-01T18:19:59Z

+        const cooldownUntil = cooldownTracker.get(c.id);
+        if (cooldownUntil && Date.now() < cooldownUntil) {
+          continue;


⚠️ Potential issue | 🟠 Major

Recreate cooldown is tracked on the old container id.

Line 149 stores cooldown using the id being removed; subsequent loops read cooldown by current container id (Line 110), so the recreated container bypasses cooldown immediately.

Proposed fix

- const cooldownUntil = cooldownTracker.get(c.id); + const cooldownKey = c.name.split('-replica-')[0]; + const cooldownUntil = cooldownTracker.get(cooldownKey); ... - cooldownTracker.set(id, Date.now() + COOLDOWN_AFTER_RECREATE_MS); + cooldownTracker.set(c.name.split('-replica-')[0], Date.now() + COOLDOWN_AFTER_RECREATE_MS);

Also applies to: 149-149

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/index.js` around lines 110 - 112, The cooldown is stored under the old container id but checked against the current container id (cooldownTracker.get(c.id)), so recreated containers bypass cooldown; fix by keying cooldowns with a stable identifier used in both places (e.g., container image or container name) instead of the transient container id: choose a stable key (like const key = c.Image || c.Names?.[0]) and replace both cooldownTracker.get(c.id) and the place that sets the cooldown (the code that currently uses the removed container id) to use cooldownTracker.get(key) and cooldownTracker.set(key, ...), ensuring the same symbol (the chosen key) is used for lookup and storage.

coderabbitai · 2026-03-01T18:19:59Z

+if (riskScore >= 80) {

-// Middleware for ID/Service validation (mock auth for docker endpoints)
-const requireDockerAuth = (req, res, next) => {
-  // In a real app, check 'Authorization' header
-  // For now, assume authenticated if internal or trusted
-  next();
-};
+  console.log(`🚨 CRITICAL risk for ${c.name}`);



⚠️ Potential issue | 🟠 Major

Risk action thresholds don’t match the documented risk bands.

Current routing uses Critical >=80 and Warning >=60, but the stated policy is Warning 40–69 and Critical 70–100. This materially changes escalation behavior.

Proposed fix

+const WARNING_RISK_THRESHOLD = 40; +const CRITICAL_RISK_THRESHOLD = 70; ... -if (riskScore >= 80) { +if (riskScore >= CRITICAL_RISK_THRESHOLD) { ... -if (riskScore >= 60) { +if (riskScore >= WARNING_RISK_THRESHOLD) {

Also applies to: 155-158

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/index.js` around lines 130 - 133, The risk routing condition using riskScore currently treats Critical as >=80 and Warning as >=60; update the conditional logic in backend/index.js (the blocks checking riskScore, e.g., the if that logs "🚨 CRITICAL risk for ${c.name}" and the corresponding Warning branch around lines referenced) so Critical is riskScore >= 70 and Warning is when riskScore is between 40 and 69 inclusive (i.e., >=40 && <=69); ensure any else/other branches cover <40 as normal/low and apply the same fixes to the second occurrence noted (also applies to the other riskScore check).

mdhaarishussain

Review: feat: implement risk-based orchestration engine

CHANGES REQUESTED — This PR has multiple critical and major bugs that would cause production failures. The mergeable_state is also dirty (merge conflicts must be resolved before this can land). Please address all issues below before re-requesting review.

🔴 Critical Issues

1. setInterval race condition — overlapping monitor loops mutate shared state concurrently
index.js: The async callback in setInterval(..., 5000) has no in-flight guard. When Docker operations (restart, recreate, list containers) take longer than 5 s, the next tick fires before the previous resolves. Multiple iterations can call healer.restartContainer(id) and healer.recreateContainer(id) on the same container simultaneously, and restartTracker / cooldownTracker / scaleTracker can be mutated by two iterations at once. (See inline comment on line 193.)

2. Risk thresholds in index.js contradict the documented risk bands
PR docs: 40–69 Warning, 70–100 Critical. Code uses >= 80 / >= 60. Containers with score 70–79 are never restarted/recreated; scores 40–59 receive no action at all. (See inline comment on line 133.)

3. Cooldown tracker keyed on transient container ID — cooldown bypassed after every recreation
healer.recreateContainer() produces a container with a new ID. cooldownTracker.get(c.id) on the next tick finds nothing → recreate fires again immediately, looping indefinitely. (See inline comment on line 112.)

4. Duplicate healing — supervisor.start() AND setInterval in index.js both independently restart the same containers
Both poll every 5 s and call healer.restartContainer() without any coordination. An unhealthy container will receive simultaneous restart calls from both systems, doubling attempts and making all tracker state unreliable.

🟠 Major Issues

5. supervisor.js: handleFailure() not awaited — unhandled promise rejections + stale healingInProgress state
Calling this.handleFailure(c.id) without await/.catch() causes unhandled rejections (process exit in Node 15+). If any await inside throws, healingInProgress.delete(containerId) never runs → permanent stale state blocking all future healing for that container. (See inline comment on supervisor.js line 20.)

6. autoScaleIfNeeded() always sees currentReplicas = 1
Replicas are named <name>-replica-<i> but the filter uses c.name === serviceName (exact). Replicas never match, so the count is always 1 regardless of reality. (See inline comment on index.js line 57.)

7. scaleService() broad substring matching can scale/remove wrong containers
n.includes(containerName) with containerName = "web" matches my-web, web-proxy, etc. The SCALE DOWN path calls container.stop() + container.remove() on those unrelated containers. (See inline comment on healer.js line 67.)

8. Replicas not assigned sentinel.monitor=true label + duplicate PortBindings
New replicas are invisible to monitoring (label missing) and creation fails for replica 2+ due to port binding conflicts (HostConfig cloned as-is). (See inline comment on healer.js line 77.)

9. Scale cooldown set unconditionally on scale failure
scaleTracker.set(...) fires even when healer.scaleService() returns { success: false }, blocking retries for the full 2-minute cooldown.

10. Breaking API changes not disclosed
The PR claims "no breaking changes" but removes: /api/status, /api/activity, /api/insights, /api/kestra-webhook (breaks AI pipeline), all /api/docker/* management endpoints, and all RBAC routes (/auth, /api/users, /api/roles). The frontend depends on all of these.

🟡 Should Fix / Nitpick

11. waitForHealthy() in supervisor.js burns the full 5-minute timeout for terminal states (removed, error). Add an early-exit when a non-recoverable status is detected.

12. autoScaleIfNeeded(c, cpuUsage) is called twice per iteration when both riskScore >= 60 AND cpuUsage > SCALE_UP_CPU_THRESHOLD. Use else if or a combined guard.

13. package-lock.json includes bcrypt, jsonwebtoken, pg, dotenv, uuid — none are used by this PR's code changes. Remove these extraneous lock-file entries.

14. Docstring coverage is 40% against a required 80% threshold (flagged by CodeRabbit pre-merge check). All new public functions need JSDoc.

Merge Conflicts

The PR's mergeable_state is dirty — conflicts with main must be resolved before this can be reviewed for merge.

mdhaarishussain · 2026-03-01T20:32:29Z

-  tracker.lastAttempt = now;
-  restartTracker.set(id, tracker);
+}, 5000);



Critical: setInterval allows overlapping async loops — concurrent mutations to shared state

When Docker ops (listContainers, restartContainer, recreateContainer) take longer than 5 s, the next tick fires before the previous resolves. Multiple iterations can race on restartTracker, cooldownTracker, scaleTracker, and call healer.restartContainer(id) or healer.recreateContainer(id) on the same container simultaneously.

let monitorLoopInFlight = false; setInterval(async () => { if (monitorLoopInFlight) return; monitorLoopInFlight = true; try { // ...existing body... } catch (err) { console.error("Background monitor error:", err.message); } finally { monitorLoopInFlight = false; } }, 5000);

mdhaarishussain · 2026-03-01T20:32:59Z

+        // Skip if in recreate cooldown
+        const cooldownUntil = cooldownTracker.get(c.id);
+        if (cooldownUntil && Date.now() < cooldownUntil) {
+          continue;


Critical: Cooldown stored under old container ID — bypassed after every recreation

healer.recreateContainer(id) removes the old container and starts a new one with a new ID. On the next loop iteration, cooldownTracker.get(c.id) looks up the new ID and finds nothing — the cooldown is permanently bypassed, triggering another recreation immediately.

Fix: Key cooldowns on container name (stable across recreation):

const cooldownKey = c.name; const cooldownUntil = cooldownTracker.get(cooldownKey); if (cooldownUntil && Date.now() < cooldownUntil) continue; // ...later when setting: cooldownTracker.set(cooldownKey, Date.now() + COOLDOWN_AFTER_RECREATE_MS);

mdhaarishussain · 2026-03-01T20:33:58Z

-  next();
-};
+  console.log(`🚨 CRITICAL risk for ${c.name}`);



Major: Risk thresholds (80/60) contradict documented bands (70/40)

PR documentation: 40–69 Warning, 70–100 Critical.
Code uses >= 80 Critical and >= 60 Warning.

Containers with risk score 70–79 only hit the warning branch (auto-scale), and are never restarted or recreated. Containers with risk score 40–59 fall through all branches entirely — no action taken.

// Use named constants matching the documented spec const CRITICAL_RISK_THRESHOLD = 70; // not 80 const WARNING_RISK_THRESHOLD = 40; // not 60 if (riskScore >= CRITICAL_RISK_THRESHOLD) { ... } else if (riskScore >= WARNING_RISK_THRESHOLD) { ... }

mdhaarishussain · 2026-03-01T20:35:21Z

-    }
+  const allContainers = await listContainers();
+  const sameService = allContainers.filter(c => c.name === serviceName);
+  const currentReplicas = sameService.length;


Major: Exact name match misses all replicas — currentReplicas is always 1

scaleService() in healer.js creates replicas named <containerName>-replica-<i>, but this filter uses c.name === serviceName (exact match). Replicas never match, so currentReplicas is always 1 regardless of how many are running.

Example: 4 replicas running → currentReplicas = 1 → code calls scaleService(name, 2) → creates a fifth container instead of doing nothing.

const sameService = allContainers.filter(c => c.name === serviceName || c.name.startsWith(`${serviceName}-replica-`) );

mdhaarishussain · 2026-03-01T20:36:02Z

+                (health.status === 'unhealthy' || c.status === 'exited') &&
+                !this.healingInProgress.has(c.id)
+            ) {
+                this.handleFailure(c.id);


Critical: handleFailure() not awaited — unhandled rejection crashes Node.js process

Fire-and-forget without .catch(). If restartContainer() or waitForHealthy() rejects, it becomes an unhandled promise rejection, which in Node.js 15+ terminates the process.

Additionally, if any await inside handleFailure() throws, healingInProgress.delete(containerId) on the last line never runs → that container ID stays in the Set permanently → future failures for this container are silently ignored forever.

// call site fix: void this.handleFailure(c.id).catch(err => { console.error(`Healing flow failed for ${c.id}:`, err.message); }); // handleFailure fix: async handleFailure(containerId) { this.healingInProgress.add(containerId); try { this.lifecycle.set(containerId, { detectedAt: Date.now(), healingTriggeredAt: Date.now(), restoredAt: null }); await restartContainer(containerId); const restored = await this.waitForHealthy(containerId); if (restored) { const record = this.lifecycle.get(containerId); record.restoredAt = Date.now(); } } finally { this.healingInProgress.delete(containerId); } }

mdhaarishussain · 2026-03-01T20:39:37Z

+    if (replicas > currentCount) {
+
+      const baseContainer = docker.getContainer(sameService[0].Id);
+      const info = await baseContainer.inspect();


Major: Broad substring matching — wrong containers can be stopped and removed

n.includes(containerName) is too permissive. A containerName of "web" matches web-app, my-web-server, web-proxy, etc. The SCALE DOWN path will then container.stop() and container.remove() those unrelated containers, causing outages.

Also no null guard: sameService[0] on line 66 throws TypeError if sameService is empty.

const sameService = containers.filter(c => c.Names && c.Names.some(n => { const name = n.replace(/^\//, ''); return name === containerName || name.startsWith(`${containerName}-replica-`); }) ); if (sameService.length === 0) { return { success: false, error: `No containers found for ${containerName}` }; }

mdhaarishussain · 2026-03-01T20:41:22Z

+          Image: info.Config.Image,
+          name: newName,
+          Env: info.Config.Env,
+          HostConfig: info.HostConfig


Major: Replicas created without sentinel.monitor=true label and with conflicting PortBindings

Two bugs in this createContainer call:

Labels from info.Config.Labels is omitted. Since listContainers() filters by { "sentinel.monitor": "true" }, new replicas are invisible to monitoring — they run unsupervised indefinitely.

HostConfig: info.HostConfig copies PortBindings as-is. Every replica tries to bind the same host port(s) → container creation fails for replica 2+, or causes port conflicts if the OS allows it.

const hostConfig = { ...info.HostConfig, PortBindings: {} }; const newContainer = await docker.createContainer({ Image: info.Config.Image, name: newName, Env: info.Config.Env, Labels: info.Config.Labels, // inherit sentinel.monitor=true label HostConfig: hostConfig // cleared PortBindings });

mdhaarishussain · 2026-03-04T14:01:33Z

ANy Updates @srija-pixel

mdhaarishussain · 2026-03-05T22:11:40Z

@srija-pixel if you do not respond within today i will have to close this PR and re assign to someone else due to time constraints

mdhaarishussain

Changes requested.

The orchestration engine concept (risk scoring, CPU-based auto-scaling, the new supervisor.js with lifecycle tracking) is interesting, but this PR removes RBAC routes and other existing functionality from index.js to implement it:

app.use('/auth', authRoutes), app.use('/api/users', ...), app.use('/api/roles', ...) — RBAC route registrations are deleted
checkServiceHealth() and the in-memory system status are removed
Activity log system (logActivity, activityLog) is removed

Removing working core features to replace them with an orchestration engine is not acceptable in a single PR. The risk scoring engine and supervisor should be added alongside existing functionality, not as a replacement.

Additionally, supervisor.js's start() method calls this.checkContainers() inside setInterval without .catch(), which will produce unhandled promise rejections on any error in Node.js.

backend/package-lock.json is also committed. Please scope this PR to only adding the orchestration engine without removing existing routes and systems.

feat: implement risk-based orchestration engine

155fb9b

coderabbitai Bot reviewed Mar 1, 2026

View reviewed changes

mdhaarishussain assigned srija-pixel Mar 1, 2026

mdhaarishussain linked an issue Mar 1, 2026 that may be closed by this pull request

🧪 Chaos Engineering & Self-Validation Framework #72

Open

12 tasks

mdhaarishussain added enhancement New feature or request apertre3.0 medium labels Mar 1, 2026

mdhaarishussain requested changes Mar 1, 2026

View reviewed changes

mdhaarishussain requested changes Mar 8, 2026

View reviewed changes

		if (error.statusCode === 404 \|\| error.reason === 'no such container') {
		return { status: 'removed', failingStreak: 0, log: [] };

-                this.handleFailure(c.id);
+                void this.handleFailure(c.id).catch(err => {
+                    console.error(`Healing flow failed for ${c.id}:`, err.message);
+                });
+     async handleFailure(containerId) {
+         this.healingInProgress.add(containerId);
+         try {
+             this.lifecycle.set(containerId, {
+                 detectedAt: Date.now(),
+                 healingTriggeredAt: Date.now(),
+                 restoredAt: null
+             });
+             await restartContainer(containerId);
+             const restored = await this.waitForHealthy(containerId);
+             if (restored) {
+                 const record = this.lifecycle.get(containerId);
+                 record.restoredAt = Date.now();
+             }
+         } finally {
+             this.healingInProgress.delete(containerId);
+         }
+     }

		await healer.scaleService(serviceName, currentReplicas + 1);
		scaleTracker.set(serviceName, Date.now());

Conversation

srija-pixel commented Mar 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Overview

🧠 Key Improvements

📊 Risk Classification

🧪 Tested Scenarios

🔄 Impact

Summary by CodeRabbit

Release Notes

Uh oh!

vercel Bot commented Mar 1, 2026

Uh oh!

coderabbitai Bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

.coderabbit.yaml has unrecognized properties

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

What the error object looks like (fields)

Where the “reason” comes from

Where the “message” comes from

Why it’s 404

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

1) Non-stream API calls (typical “removed container” case)

2) Stream-based operations (pull/build/etc.) and followProgress

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

mdhaarishussain left a comment

Choose a reason for hiding this comment

Review: feat: implement risk-based orchestration engine

🔴 Critical Issues

🟠 Major Issues

🟡 Should Fix / Nitpick

Merge Conflicts

Uh oh!

mdhaarishussain Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

mdhaarishussain Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

mdhaarishussain Mar 1, 2026

srija-pixel commented Mar 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 1, 2026 •

edited

Loading

`.coderabbit.yaml` has unrecognized properties

2) Stream-based operations (pull/build/etc.) and `followProgress`