Skip to content

feat: implement risk-based orchestration engine#157

Open
srija-pixel wants to merge 1 commit into
SKfaizan-786:mainfrom
srija-pixel:feature/phase-7-risk-engine
Open

feat: implement risk-based orchestration engine#157
srija-pixel wants to merge 1 commit into
SKfaizan-786:mainfrom
srija-pixel:feature/phase-7-risk-engine

Conversation

@srija-pixel

@srija-pixel srija-pixel commented Mar 1, 2026

Copy link
Copy Markdown
Contributor

🚀 Overview

This PR introduces an intelligent risk-based orchestration engine for container monitoring.

Instead of binary health-based decisions, the system now computes a weighted risk score (0–100) based on:

  • CPU utilization
  • Memory utilization
  • Restart frequency
  • Container health state

🧠 Key Improvements

  • Multi-factor risk scoring
  • Adaptive restart logic
  • Escalation to container recreation
  • Cooldown protection
  • Preventive scaling support
  • Race-condition safe monitoring

📊 Risk Classification

  • 0–39 → Stable
  • 40–69 → Warning
  • 70–100 → Critical

🧪 Tested Scenarios

  • CPU stress simulation
  • Permanent unhealthy container
  • Restart escalation
  • Cooldown protection

🔄 Impact

Transforms the monitoring layer from reactive rule-based recovery to adaptive intelligent orchestration.

No breaking changes introduced.

Summary by CodeRabbit

Release Notes

  • New Features

    • Automatic detection and recovery of failed containers with self-healing capabilities
    • Intelligent CPU-based auto-scaling to optimize resource usage
    • Comprehensive container health monitoring with risk assessment system
  • Improvements

    • Enhanced error handling for container lifecycle operations
    • More precise health metrics reporting

@vercel

vercel Bot commented Mar 1, 2026

Copy link
Copy Markdown

@srija-pixel is attempting to deploy a commit to the SK Faizanuddin's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai

coderabbitai Bot commented Mar 1, 2026

Copy link
Copy Markdown
Contributor

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'version', 'tools', 'instructions', 'ignored_files'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
📝 Walkthrough

Walkthrough

Refactored Docker container orchestration from Swarm-based service management to direct container handling. Introduced a new ContainerSupervisor class that monitors container health, detects failures, and orchestrates healing with lifecycle tracking. Enhanced error handling across Docker client modules and rewrote the backend to use dynamic risk-based container management with auto-scaling logic.

Changes

Cohort / File(s) Summary
Error Handling Improvements
backend/docker/client.js, backend/docker/monitor.js
Enhanced error handling to explicitly detect container removal (404 status/"no such container" cases) and handle them distinctly from unexpected errors. Monitor now returns CPU/memory percentages as numbers instead of strings.
Container Scaling Refactoring
backend/docker/healer.js
Replaced Docker Swarm service-based scaling with direct container management. Function signature changed from scaleService(serviceName, replicas) to scaleService(containerName, replicas). Now lists, filters, and manages container replicas with naming convention <containerName>-replica-<i>.
New Container Supervisor
backend/docker/supervisor.js
Introduced new ContainerSupervisor class that periodically monitors container health, detects failures, orchestrates healing with lifecycle metadata tracking, and provides methods for health polling and lifecycle access. Exports singleton instance.
Backend Orchestration Rewrite
backend/index.js
Replaced static health checks with dynamic container monitoring. Added autoScaleIfNeeded() and computeRiskScore() functions for automated scaling and governance-based container lifecycle management. Integrated supervisor-driven background monitoring loop with per-container restart and cooldown tracking.

Sequence Diagram(s)

sequenceDiagram
    participant Monitor as Monitor
    participant Supervisor as Supervisor
    participant Client as Docker Client
    participant Healer as Healer
    participant Index as Backend Index

    Note over Monitor,Index: Health Check & Healing Cycle

    Supervisor->>Monitor: Periodic interval (5s)
    Monitor->>Client: getContainerHealth(containerId)
    Client-->>Monitor: {status, metrics}
    
    alt Container Unhealthy or Exited
        Monitor->>Supervisor: Detect failure
        Supervisor->>Supervisor: recordLifecycle(detectedAt)
        Supervisor->>Client: restartContainer(containerId)
        Client-->>Supervisor: Restarted
        
        loop Poll every 3s (timeout: 5min)
            Supervisor->>Client: getContainerHealth(containerId)
            Client-->>Supervisor: Health status
        end
        
        alt Container Healthy
            Supervisor->>Supervisor: recordLifecycle(restoredAt)
            Supervisor-->>Index: Recovery complete
        else Timeout
            Supervisor->>Index: Healing failed
        end
    else Needs Scaling
        Index->>Index: computeRiskScore(metrics)
        Index->>Index: autoScaleIfNeeded(cpuUsage)
        Index->>Healer: scaleService(containerName, targetReplicas)
        Healer->>Client: List & manage containers
        Client-->>Healer: Container operations
        Healer-->>Index: Scaling result
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • Added Docker Container Monitoring #39: Introduces initial Docker module functions (getContainerHealth, scaleService, monitor stream handling) that are now enhanced and refactored in this PR.

Suggested labels

enhancement, hard

Suggested reviewers

  • mdhaarishussain

Poem

🐰 A rabbit's cheer for this orchestration!
Containers now heal with dedication,
Health checks and scaling dance in sync,
Risk scores guide each thoughtful link,
No more manual care—just supervision! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: implement risk-based orchestration engine' accurately and concisely summarizes the main feature introduced by the changeset, reflecting the core implementation of risk-based container orchestration.
Description check ✅ Passed The PR description covers the overview, key improvements, risk classification, tested scenarios, and impact, though it lacks the structured sections required by the template (Related Issues, Type of Change checkbox, How Has This Been Tested details, and Checklist).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (1)
backend/index.js (1)

155-160: Avoid duplicate scale checks in the same iteration.

Scaling is already attempted in the warning branch (Line 159), then re-attempted for high CPU (Line 176). This is redundant and makes behavior harder to reason about.

Proposed refactor
-        if (cpuUsage > SCALE_UP_CPU_THRESHOLD) {
+        if (riskScore < 60 && cpuUsage > SCALE_UP_CPU_THRESHOLD) {
           await autoScaleIfNeeded(c, cpuUsage);
         }

Also applies to: 176-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/index.js` around lines 155 - 160, The code currently calls
autoScaleIfNeeded(c, cpuUsage) twice in the same loop (once when riskScore >= 60
and again when CPU is high), causing duplicate scale attempts; change the logic
so autoScaleIfNeeded is invoked at most once per container per iteration—either
by combining the checks into a single boolean (e.g., needScale = riskScore >= 60
|| cpuUsage > HIGH_CPU_THRESHOLD) and calling autoScaleIfNeeded(c, cpuUsage)
only when needScale, or by using an if/else-if so the high-CPU branch skips
calling autoScaleIfNeeded if the riskScore branch already did; update references
to riskScore, cpuUsage, and autoScaleIfNeeded accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/docker/client.js`:
- Around line 48-49: The container health error check in
backend/docker/client.js assumes error.reason exists; make it defensive by
checking multiple error properties (e.g., error.reason, error.message,
error.statusMessage or parsed error body) with optional chaining and lowercasing
before comparing to 'no such container', and keep the existing statusCode ===
404 check; update the conditional in the health-checking function (the block
returning { status: 'removed', failingStreak: 0, log: [] }) to use these
fallbacks so removed containers are detected even when the error shape varies.

In `@backend/docker/healer.js`:
- Around line 57-67: The container-group matching in sameService is too broad
and unsafe: change the filter on containers to ensure c.Names exists and compare
names exactly (trim leading '/' from each name and match equality to
containerName) instead of using includes(containerName), and before using
sameService[0] in SCALE UP add a null/empty guard (e.g., throw or log and abort
scaling if sameService.length === 0) prior to calling docker.getContainer and
baseContainer.inspect to avoid runtime errors or scaling the wrong group.
- Around line 73-77: scaleService() currently clones the base container's config
but omits Labels and copies HostConfig including PortBindings, which causes new
replicas to be excluded from monitoring and to conflict on host ports; when
calling docker.createContainer to make newContainer, include info.Config.Labels
(or at least the sentinel.monitor=true label) in the container spec so replicas
are monitored, and remove or clear HostConfig.PortBindings (e.g., delete
HostConfig.PortBindings or set it to an empty object) before passing HostConfig
to docker.createContainer to avoid duplicating host port bindings.

In `@backend/docker/monitor.js`:
- Around line 27-33: The stream error handling should also detect "no such
container" when the text is in err.message as well as err.reason; update the
conditional around the error handler that currently checks err.statusCode and
err.reason to instead test for statusCode === 404 or a case-insensitive match of
"no such container" against (err.reason || err.message), and then call
this.stopMonitoring(containerId) and return as before (references:
this.stopMonitoring, containerId, err).

In `@backend/docker/supervisor.js`:
- Around line 49-59: The waitForHealthy loop currently waits the full timeout
even when the container enters terminal non-healthy states; update the loop in
waitForHealthy to check the health.status returned by
getContainerHealth(containerId) for terminal states (e.g., 'removed', 'error',
'dead', 'exited' or any non-recoverable status you use) and immediately return
false (or fail fast) when one of those states is seen instead of continuing to
sleep until timeout; keep the existing healthy check that returns true when
status === 'healthy'.
- Line 20: The call to this.handleFailure(c.id) should be awaited and its
rejection handled to avoid unhandled promise rejections; change invocations of
handleFailure to await this.handleFailure(containerId) inside a try/catch and
log or handle any errors. Also wrap the async healing logic that manipulates
healingInProgress (the block around where healingInProgress.add(containerId) and
healingInProgress.delete(containerId) are used) in a try/finally so
healingInProgress.delete(containerId) always runs; in the try handle normal
async steps and in catch log errors, and in finally remove the containerId from
healingInProgress to prevent stale state blocking future healing attempts.

In `@backend/index.js`:
- Around line 62-63: The code sets the cooldown in scaleTracker unconditionally
after calling healer.scaleService, which blocks retries even when scaling
failed; update the call sites that currently do await
healer.scaleService(serviceName, currentReplicas + 1) / await
healer.scaleService(serviceName, currentReplicas - 1) so that you capture the
result (e.g., const res = await healer.scaleService(...)) and only call
scaleTracker.set(serviceName, Date.now()) when res.success is truthy (and
optionally log or handle res.error when false); ensure both the scale-up and
scale-down branches use this success check so failed scale attempts don't start
the cooldown.
- Around line 130-133: The risk routing condition using riskScore currently
treats Critical as >=80 and Warning as >=60; update the conditional logic in
backend/index.js (the blocks checking riskScore, e.g., the if that logs "🚨
CRITICAL risk for ${c.name}" and the corresponding Warning branch around lines
referenced) so Critical is riskScore >= 70 and Warning is when riskScore is
between 40 and 69 inclusive (i.e., >=40 && <=69); ensure any else/other branches
cover <40 as normal/low and apply the same fixes to the second occurrence noted
(also applies to the other riskScore check).
- Around line 110-112: The cooldown is stored under the old container id but
checked against the current container id (cooldownTracker.get(c.id)), so
recreated containers bypass cooldown; fix by keying cooldowns with a stable
identifier used in both places (e.g., container image or container name) instead
of the transient container id: choose a stable key (like const key = c.Image ||
c.Names?.[0]) and replace both cooldownTracker.get(c.id) and the place that sets
the cooldown (the code that currently uses the removed container id) to use
cooldownTracker.get(key) and cooldownTracker.set(key, ...), ensuring the same
symbol (the chosen key) is used for lookup and storage.
- Around line 50-57: The replica count logic uses exact name matches (c.name ===
serviceName) but replicas are created with a "-replica-*" suffix, causing
incorrect currentReplicas; update the filter in the autoscaling block (where
serviceName, listContainers, sameService, currentReplicas, scaleTracker and
SCALE_COOLDOWN_MS are used) to include both the base service name and replica
names (e.g., c.name === serviceName || c.name.startsWith(serviceName +
'-replica-') or use a regex like ^serviceName(-replica-\d+)?$) so
currentReplicas correctly counts all replicas.
- Around line 101-193: The background loop uses setInterval which allows
overlapping runs and races on restartTracker, cooldownTracker, scaleTracker and
concurrent calls to
healer.restartContainer/healer.recreateContainer/autoScaleIfNeeded; replace the
setInterval pattern with a single async loop (e.g., an async function that
awaits a delay each iteration) or implement per-container locks (a Map of
mutexes/Promise-chains) and acquire the lock at the start of processing each
container id, releasing it after all actions (monitor.startMonitoring,
computeRiskScore, healer.restartContainer, healer.recreateContainer,
autoScaleIfNeeded, monitor.stopMonitoring) complete; ensure mutations to
restartTracker/cooldownTracker/scaleTracker and calls to healer.* are performed
only while the lock is held so no two iterations can act on the same container
concurrently.

---

Nitpick comments:
In `@backend/index.js`:
- Around line 155-160: The code currently calls autoScaleIfNeeded(c, cpuUsage)
twice in the same loop (once when riskScore >= 60 and again when CPU is high),
causing duplicate scale attempts; change the logic so autoScaleIfNeeded is
invoked at most once per container per iteration—either by combining the checks
into a single boolean (e.g., needScale = riskScore >= 60 || cpuUsage >
HIGH_CPU_THRESHOLD) and calling autoScaleIfNeeded(c, cpuUsage) only when
needScale, or by using an if/else-if so the high-CPU branch skips calling
autoScaleIfNeeded if the riskScore branch already did; update references to
riskScore, cpuUsage, and autoScaleIfNeeded accordingly.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5dc7cd4 and 155fb9b.

⛔ Files ignored due to path filters (1)
  • backend/package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (5)
  • backend/docker/client.js
  • backend/docker/healer.js
  • backend/docker/monitor.js
  • backend/docker/supervisor.js
  • backend/index.js

Comment thread backend/docker/client.js
Comment on lines +48 to +49
if (error.statusCode === 404 || error.reason === 'no such container') {
return { status: 'removed', failingStreak: 0, log: [] };

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's check if the file exists and read the relevant section
if [ -f "backend/docker/client.js" ]; then
  echo "=== Lines 40-60 of backend/docker/client.js ==="
  sed -n '40,60p' backend/docker/client.js | cat -n
else
  # Try to find the file
  find . -name "client.js" -type f | head -20
fi

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 919


🏁 Script executed:

#!/bin/bash
# Also search for error handling patterns in this file
echo "=== Searching for error handling patterns ==="
rg -n "error\.(reason|message|json)" backend/docker/client.js -A 2 -B 2

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 545


🏁 Script executed:

#!/bin/bash
# Search for tests or documentation about error structures
echo "=== Looking for test files or documentation ==="
fd -e test.js -e spec.js -e .test.js backend/ | head -10
find backend -name "*.md" -o -name "*.test.js" | head -10

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 480


🏁 Script executed:

#!/bin/bash
# Check the imports and initialization of docker client
echo "=== Looking at imports and docker initialization ==="
head -50 backend/docker/client.js | cat -n

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 2015


🏁 Script executed:

#!/bin/bash
# Look for Docker library in package.json
echo "=== Checking package.json for Docker dependencies ==="
rg "docker|moby" package.json -i

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 201


🏁 Script executed:

#!/bin/bash
# Search for where getContainerHealth is called to see error context
echo "=== Searching for getContainerHealth usage ==="
rg "getContainerHealth" backend/ -B 3 -A 3

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1784


🏁 Script executed:

#!/bin/bash
# Search for any error structure documentation or examples
echo "=== Searching for error handling patterns ==="
rg "error\." backend/docker/client.js -B 1 -A 1 | head -40

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 451


🏁 Script executed:

#!/bin/bash
# Look for test files more broadly
echo "=== Searching for test files ==="
find . -type f -name "*.test.js" -o -name "*.spec.js" | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 218


🏁 Script executed:

#!/bin/bash
# Look for any error fixtures or mock data
echo "=== Searching for error-related test data ==="
rg "no such container" . -l

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 178


🏁 Script executed:

#!/bin/bash
# Look at the root directory to understand structure
echo "=== Checking repository structure ==="
ls -la | head -30

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1483


🏁 Script executed:

#!/bin/bash
# Check if package.json exists elsewhere
echo "=== Finding package.json ==="
find . -name "package.json" -type f

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 303


🏁 Script executed:

#!/bin/bash
# Check backend package.json for dockerode version
echo "=== backend/package.json ==="
cat backend/package.json

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 760


🏁 Script executed:

#!/bin/bash
# Check monitor.js for context about how it uses getContainerHealth
echo "=== Checking monitor.js ==="
rg "no such container" backend/docker/monitor.js -B 5 -A 5

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 440


🏁 Script executed:

#!/bin/bash
# Look at one of the test files to see if there are mock errors
echo "=== Checking test files for error patterns ==="
head -100 cli/tests/heal.test.js | cat -n

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 3947


🌐 Web query:

dockerode error structure "no such container" message reason

💡 Result:

In dockerode, the error you get for a missing container is a normal Docker Engine HTTP 404 (“no such container”) surfaced through docker-modem.

What the error object looks like (fields)

docker-modem builds an Error and adds these extra fields when the HTTP status code is not in the “success” set: reason, statusCode, and json. The human-readable err.message is formatted from those pieces too. [1]

So for “no such container”, you typically see something like:

{
  name: 'Error',
  message: '(HTTP code 404) no such container - No such container: <id> ',
  reason: 'no such container',
  statusCode: 404,
  json: { message: 'No such container: <id>' }
}

Where the “reason” comes from

err.reason is not coming from Docker itself; it comes from the statusCodes mapping that dockerode passes into docker-modem (for 404 it’s commonly the string 'no such container'). docker-modem assigns that mapping value to msg.reason. [1]

Where the “message” comes from

Docker Engine’s API error responses include a JSON body with a message field (for example: {"message":"No such container: c2ada9df5af8"}), and docker-modem stores the parsed body on err.json and also uses json.message when composing err.message. [1][2]

Why it’s 404

Docker Engine endpoints that reference a container return 404 – no such container when the container id/name doesn’t exist. [3]

Sources: [1] docker-modem buildPayload implementation (adds reason, statusCode, json and formats message), [2] example of Docker API “No such container” JSON error body, [3] Docker Engine API docs showing 404 “no such container”.


🏁 Script executed:

#!/bin/bash
# Let's also check if there are any comments or examples in the codebase about error structures
echo "=== Searching for error structure examples ==="
rg "error\.(statusCode|reason|message|json)" backend/ -B 2 -A 2

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 12911


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation in the repo about Docker errors
echo "=== Checking documentation files ==="
grep -r "no such container" docs/ 2>/dev/null || echo "Not found in docs"
grep -r "Docker error" docs/ 2>/dev/null || echo "Not found in docs"

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 150


Add defensive fallbacks for error message checking in container health errors.

Line 48 checks error.statusCode === 404 as the primary condition, which correctly catches the "no such container" case. However, the secondary check error.reason === 'no such container' assumes the reason property is always present. While dockerode/docker-modem should set this property, a more defensive approach would check multiple error properties as fallbacks to ensure robustness:

Suggested fix
-    if (error.statusCode === 404 || error.reason === 'no such container') {
+    const reason = `${error.reason || error.message || error.json?.message || ''}`.toLowerCase();
+    if (error.statusCode === 404 || reason.includes('no such container')) {
       return { status: 'removed', failingStreak: 0, log: [] };
     }

This handles edge cases where the error structure varies, ensuring removed containers are always correctly identified regardless of which error property contains the message.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/docker/client.js` around lines 48 - 49, The container health error
check in backend/docker/client.js assumes error.reason exists; make it defensive
by checking multiple error properties (e.g., error.reason, error.message,
error.statusMessage or parsed error body) with optional chaining and lowercasing
before comparing to 'no such container', and keep the existing statusCode ===
404 check; update the conditional in the health-checking function (the block
returning { status: 'removed', failingStreak: 0, log: [] }) to use these
fallbacks so removed containers are detected even when the error shape varies.

Comment thread backend/docker/healer.js
Comment on lines +57 to +67
const sameService = containers.filter(c =>
c.Names.some(n => n.includes(containerName))
);

const currentCount = sameService.length;

// SCALE UP
if (replicas > currentCount) {

const baseContainer = docker.getContainer(sameService[0].Id);
const info = await baseContainer.inspect();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Container-group matching is too broad and not null-safe.

Line 58 uses substring matching (includes(containerName)), which can match unrelated names, and Line 66 assumes at least one match exists. That can scale the wrong group or throw on scale-up.

Proposed fix
-    const sameService = containers.filter(c =>
-      c.Names.some(n => n.includes(containerName))
-    );
+    const sameService = containers.filter(c =>
+      c.Names.some(n => {
+        const name = n.replace(/^\//, '');
+        return name === containerName || name.startsWith(`${containerName}-replica-`);
+      })
+    );
+
+    if (sameService.length === 0) {
+      return { success: false, error: `No containers found for ${containerName}` };
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/docker/healer.js` around lines 57 - 67, The container-group matching
in sameService is too broad and unsafe: change the filter on containers to
ensure c.Names exists and compare names exactly (trim leading '/' from each name
and match equality to containerName) instead of using includes(containerName),
and before using sameService[0] in SCALE UP add a null/empty guard (e.g., throw
or log and abort scaling if sameService.length === 0) prior to calling
docker.getContainer and baseContainer.inspect to avoid runtime errors or scaling
the wrong group.

Comment thread backend/docker/healer.js
Comment on lines +73 to +77
const newContainer = await docker.createContainer({
Image: info.Config.Image,
name: newName,
Env: info.Config.Env,
HostConfig: info.HostConfig

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the healer.js file around lines 73-77
head -n 85 backend/docker/healer.js | tail -n 20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 617


🏁 Script executed:

# Also check the broader context of the healer.js file to understand the function
head -n 95 backend/docker/healer.js | tail -n 30

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 927


🏁 Script executed:

# Now let's look at client.js line 12 to understand the listContainers filtering
head -n 25 backend/docker/client.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 882


🏁 Script executed:

# Let's also search for where info.Config.Labels is used or referenced
rg "Config\.Labels" backend/docker/

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 60


🏁 Script executed:

# Check if PortBindings is ever explicitly handled elsewhere
rg "PortBindings" backend/docker/

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 60


🏁 Script executed:

# Check if labels are set when creating the initial container
rg "sentinel\.monitor" backend/

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 144


🏁 Script executed:

# Look for docker creation code to see if labels are set initially
rg "createContainer" backend/docker/ -A 5

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1281


🏁 Script executed:

# Check Docker API documentation patterns - look for how containers are created elsewhere
rg "Labels\s*:" backend/ -A 2 -B 2

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 60


🏁 Script executed:

# Check if there are any imports or usage of docker-compose or config files
fd -t f "\.(yml|yaml|json)" backend/ | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 124


🏁 Script executed:

# Let's see the full context of both createContainer calls to understand the difference
sed -n '45,105p' backend/docker/healer.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1711


🏁 Script executed:

# Check if there's a pattern for how containers are initially created with labels
rg "docker.createContainer" backend/ -B 10 -A 10

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 2433


🏁 Script executed:

# Let's understand the Docker API inspect response structure - check if there are any comments or docs
head -n 150 backend/docker/healer.js | tail -n 60

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1683


Add Labels and remove PortBindings when creating replicas in scaleService().

The scaleService() function creates replicas without copying Labels from the base container. Since listContainers() filters by sentinel.monitor=true label, new replicas will be skipped by monitoring. Additionally, cloning HostConfig as-is duplicates PortBindings across replicas, causing host-port binding conflicts.

Proposed fix
+        const hostConfig = { ...info.HostConfig };
+        delete hostConfig.PortBindings; // avoid duplicate host-port binding conflicts

         const newContainer = await docker.createContainer({
           Image: info.Config.Image,
           name: newName,
           Env: info.Config.Env,
+          Labels: info.Config.Labels,
-          HostConfig: info.HostConfig
+          HostConfig: hostConfig
         });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/docker/healer.js` around lines 73 - 77, scaleService() currently
clones the base container's config but omits Labels and copies HostConfig
including PortBindings, which causes new replicas to be excluded from monitoring
and to conflict on host ports; when calling docker.createContainer to make
newContainer, include info.Config.Labels (or at least the sentinel.monitor=true
label) in the container spec so replicas are monitored, and remove or clear
HostConfig.PortBindings (e.g., delete HostConfig.PortBindings or set it to an
empty object) before passing HostConfig to docker.createContainer to avoid
duplicating host port bindings.

Comment thread backend/docker/monitor.js
Comment on lines +27 to +33
if (err.statusCode === 404 || err.reason === 'no such container') {
this.stopMonitoring(containerId);
return;
}

console.error(`Stream error for ${containerId}:`, err.message);
this.stopMonitoring(containerId);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "monitor.js" | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 105


🏁 Script executed:

cat -n ./backend/docker/monitor.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 4819


🏁 Script executed:

# Search for other error handling patterns in the codebase related to Docker
rg "err\." --type js -B 2 -A 2 | head -100

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 1554


🏁 Script executed:

# Check which Docker library is being used
grep -r "require.*docker" --include="*.js" | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 401


🌐 Web query:

dockerode stream error handling removed container error structure

💡 Result:

In dockerode, the “error structure” you see is mostly defined by docker-modem (the HTTP layer used by dockerode).

1) Non-stream API calls (typical “removed container” case)

When you call a normal endpoint (e.g., container.inspect(), container.stop(), container.remove(), container.logs() without follow, etc.) and the container is already gone, Docker Engine usually replies HTTP 404 with a JSON body like:

{ "message": "No such container: <id>" }

docker-modem turns any non-2xx response into a JavaScript Error and adds these properties:

  • err.statusCode → the HTTP status (e.g., 404)
  • err.reason → whatever dockerode’s endpoint statusCodes[404] mapping is for that call (commonly something like "no such container")
  • err.json → the parsed response body (often the object containing "message")

And it builds err.message like:
(HTTP code 404) <reason> - <cause.message || cause.error || cause> (plus spaces). (raw.githubusercontent.com)

So you generally want to handle removed-container as:

  • if (err?.statusCode === 404) and then read err.json?.message.

2) Stream-based operations (pull/build/etc.) and followProgress

For streaming endpoints that you pass into:

docker.modem.followProgress(stream, onFinished, onProgress)

followProgress collects JSON lines from the stream into output[], and:

  • if the stream emits an 'error' event, it calls onFinished(err, output)
  • if the stream ends/closes normally, it calls onFinished(null, output) (raw.githubusercontent.com)

Meaning: if a container is removed and the daemon simply closes an attached/logs stream, you may get end/close (no error) rather than a structured HTTP error; you only get the structured (HTTP code ...) error object when there is an HTTP response for a request that failed (like an inspect/remove call). (raw.githubusercontent.com)

Citations:


🏁 Script executed:

# Look for other places where stream errors might be handled or logged
rg "stream\.on\('error'" --type js -B 3 -A 5

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 577


Match removed-container errors in both err.reason and err.message.

Line 27 only checks err.statusCode === 404 or err.reason === 'no such container'. Docker stream errors from dockerode can place error details in err.message instead of (or in addition to) err.reason, so removed containers may incorrectly trigger the generic stream error handler.

Proposed fix
-    if (err.statusCode === 404 || err.reason === 'no such container') {
+    const reason = `${err.reason || err.message || err.json?.message || ''}`.toLowerCase();
+    if (err.statusCode === 404 || reason.includes('no such container')) {
         this.stopMonitoring(containerId);
         return;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/docker/monitor.js` around lines 27 - 33, The stream error handling
should also detect "no such container" when the text is in err.message as well
as err.reason; update the conditional around the error handler that currently
checks err.statusCode and err.reason to instead test for statusCode === 404 or a
case-insensitive match of "no such container" against (err.reason ||
err.message), and then call this.stopMonitoring(containerId) and return as
before (references: this.stopMonitoring, containerId, err).

(health.status === 'unhealthy' || c.status === 'exited') &&
!this.healingInProgress.has(c.id)
) {
this.handleFailure(c.id);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the file
fd "supervisor.js" --type f

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 106


🏁 Script executed:

# Read the supervisor.js file to see the actual code
cat -n backend/docker/supervisor.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 2471


Add error handling and try/finally to prevent unhandled rejections and state leaks.

Line 20 calls handleFailure() without awaiting or catching—if the promise rejects, you get an unhandled rejection. Lines 25-44 lack try/finally, so if any async operation throws, healingInProgress.delete(containerId) on line 43 never runs, leaving stale state that blocks future healing attempts for that container.

Proposed fix
-                this.handleFailure(c.id);
+                void this.handleFailure(c.id).catch(err => {
+                    console.error(`Healing flow failed for ${c.id}:`, err.message);
+                });

     async handleFailure(containerId) {
-        this.healingInProgress.add(containerId);
-
-        this.lifecycle.set(containerId, {
-            detectedAt: Date.now(),
-            healingTriggeredAt: Date.now(),
-            restoredAt: null
-        });
-
-        await restartContainer(containerId);
-
-        const restored = await this.waitForHealthy(containerId);
-
-        if (restored) {
-            const record = this.lifecycle.get(containerId);
-            record.restoredAt = Date.now();
-        }
-
-        this.healingInProgress.delete(containerId);
+        this.healingInProgress.add(containerId);
+        try {
+            this.lifecycle.set(containerId, {
+                detectedAt: Date.now(),
+                healingTriggeredAt: Date.now(),
+                restoredAt: null
+            });
+
+            await restartContainer(containerId);
+            const restored = await this.waitForHealthy(containerId);
+
+            if (restored) {
+                const record = this.lifecycle.get(containerId);
+                record.restoredAt = Date.now();
+            }
+        } finally {
+            this.healingInProgress.delete(containerId);
+        }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
this.handleFailure(c.id);
void this.handleFailure(c.id).catch(err => {
console.error(`Healing flow failed for ${c.id}:`, err.message);
});
async handleFailure(containerId) {
this.healingInProgress.add(containerId);
try {
this.lifecycle.set(containerId, {
detectedAt: Date.now(),
healingTriggeredAt: Date.now(),
restoredAt: null
});
await restartContainer(containerId);
const restored = await this.waitForHealthy(containerId);
if (restored) {
const record = this.lifecycle.get(containerId);
record.restoredAt = Date.now();
}
} finally {
this.healingInProgress.delete(containerId);
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/docker/supervisor.js` at line 20, The call to
this.handleFailure(c.id) should be awaited and its rejection handled to avoid
unhandled promise rejections; change invocations of handleFailure to await
this.handleFailure(containerId) inside a try/catch and log or handle any errors.
Also wrap the async healing logic that manipulates healingInProgress (the block
around where healingInProgress.add(containerId) and
healingInProgress.delete(containerId) are used) in a try/finally so
healingInProgress.delete(containerId) always runs; in the try handle normal
async steps and in catch log errors, and in finally remove the containerId from
healingInProgress to prevent stale state blocking future healing attempts.

Comment thread backend/index.js
Comment on lines +50 to +57
const serviceName = container.name;
const lastScale = scaleTracker.get(serviceName) || 0;

if (
systemStatus.services[service.name].status !== newStatus ||
systemStatus.services[service.name].code !== newCode
) {
const prevStatus = systemStatus.services[service.name].status;

// Log Status Changes
if (newStatus === 'healthy' && prevStatus !== 'healthy' && prevStatus !== 'unknown') {
logActivity('success', `Service ${service.name} recovered to HEALTHY`);
} else if (newStatus !== 'healthy' && prevStatus !== newStatus) {
const severity = newStatus === 'critical' ? 'alert' : 'warn';
logActivity(severity, `Service ${service.name} is ${newStatus.toUpperCase()} (Code: ${newCode})`);
}
if (Date.now() - lastScale < SCALE_COOLDOWN_MS) return;

systemStatus.services[service.name] = {
status: newStatus,
code: newCode,
lastUpdated: new Date()
};
hasChanges = true;

// Broadcast individual service update
wsBroadcaster.broadcast('SERVICE_UPDATE', {
name: service.name,
...systemStatus.services[service.name]
});
}
}
const allContainers = await listContainers();
const sameService = allContainers.filter(c => c.name === serviceName);
const currentReplicas = sameService.length;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replica grouping in autoscaling is inconsistent with replica naming.

Line 56 counts only exact name matches (c.name === serviceName), while replicas are named with -replica-* in backend/docker/healer.js Line 71. This can send incorrect target replica counts and cause scale oscillation.

Proposed fix
 async function autoScaleIfNeeded(container, cpuUsage) {
-  const serviceName = container.name;
+  const serviceName = container.name.split('-replica-')[0];
   const lastScale = scaleTracker.get(serviceName) || 0;

   if (Date.now() - lastScale < SCALE_COOLDOWN_MS) return;

   const allContainers = await listContainers();
-  const sameService = allContainers.filter(c => c.name === serviceName);
+  const sameService = allContainers.filter(c => c.name.split('-replica-')[0] === serviceName);
   const currentReplicas = sameService.length;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const serviceName = container.name;
const lastScale = scaleTracker.get(serviceName) || 0;
if (
systemStatus.services[service.name].status !== newStatus ||
systemStatus.services[service.name].code !== newCode
) {
const prevStatus = systemStatus.services[service.name].status;
// Log Status Changes
if (newStatus === 'healthy' && prevStatus !== 'healthy' && prevStatus !== 'unknown') {
logActivity('success', `Service ${service.name} recovered to HEALTHY`);
} else if (newStatus !== 'healthy' && prevStatus !== newStatus) {
const severity = newStatus === 'critical' ? 'alert' : 'warn';
logActivity(severity, `Service ${service.name} is ${newStatus.toUpperCase()} (Code: ${newCode})`);
}
if (Date.now() - lastScale < SCALE_COOLDOWN_MS) return;
systemStatus.services[service.name] = {
status: newStatus,
code: newCode,
lastUpdated: new Date()
};
hasChanges = true;
// Broadcast individual service update
wsBroadcaster.broadcast('SERVICE_UPDATE', {
name: service.name,
...systemStatus.services[service.name]
});
}
}
const allContainers = await listContainers();
const sameService = allContainers.filter(c => c.name === serviceName);
const currentReplicas = sameService.length;
const serviceName = container.name.split('-replica-')[0];
const lastScale = scaleTracker.get(serviceName) || 0;
if (Date.now() - lastScale < SCALE_COOLDOWN_MS) return;
const allContainers = await listContainers();
const sameService = allContainers.filter(c => c.name.split('-replica-')[0] === serviceName);
const currentReplicas = sameService.length;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/index.js` around lines 50 - 57, The replica count logic uses exact
name matches (c.name === serviceName) but replicas are created with a
"-replica-*" suffix, causing incorrect currentReplicas; update the filter in the
autoscaling block (where serviceName, listContainers, sameService,
currentReplicas, scaleTracker and SCALE_COOLDOWN_MS are used) to include both
the base service name and replica names (e.g., c.name === serviceName ||
c.name.startsWith(serviceName + '-replica-') or use a regex like
^serviceName(-replica-\d+)?$) so currentReplicas correctly counts all replicas.

Comment thread backend/index.js
Comment on lines +62 to +63
await healer.scaleService(serviceName, currentReplicas + 1);
scaleTracker.set(serviceName, Date.now());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only apply scale cooldown after successful scaling.

Lines 63 and 71 set cooldown unconditionally. healer.scaleService() returns { success: false } on errors, so failed actions still block retries for the cooldown window.

Proposed fix
-    await healer.scaleService(serviceName, currentReplicas + 1);
-    scaleTracker.set(serviceName, Date.now());
+    const upResult = await healer.scaleService(serviceName, currentReplicas + 1);
+    if (upResult.success) {
+      scaleTracker.set(serviceName, Date.now());
+    } else {
+      console.error(`Scale UP failed for ${serviceName}:`, upResult.error);
+    }
     return;
   }

   // SCALE DOWN
   if (cpuUsage < SCALE_DOWN_CPU_THRESHOLD && currentReplicas > MIN_REPLICAS) {
     console.log(`📉 Scaling DOWN ${serviceName} (CPU ${cpuUsage}%)`);
-    await healer.scaleService(serviceName, currentReplicas - 1);
-    scaleTracker.set(serviceName, Date.now());
+    const downResult = await healer.scaleService(serviceName, currentReplicas - 1);
+    if (downResult.success) {
+      scaleTracker.set(serviceName, Date.now());
+    } else {
+      console.error(`Scale DOWN failed for ${serviceName}:`, downResult.error);
+    }
   }

Also applies to: 70-71

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/index.js` around lines 62 - 63, The code sets the cooldown in
scaleTracker unconditionally after calling healer.scaleService, which blocks
retries even when scaling failed; update the call sites that currently do await
healer.scaleService(serviceName, currentReplicas + 1) / await
healer.scaleService(serviceName, currentReplicas - 1) so that you capture the
result (e.g., const res = await healer.scaleService(...)) and only call
scaleTracker.set(serviceName, Date.now()) when res.success is truthy (and
optionally log or handle res.error when false); ensure both the scale-up and
scale-down branches use this success check so failed scale attempts don't start
the cooldown.

Comment thread backend/index.js
Comment on lines +101 to 193
setInterval(async () => {
try {
const containers = await listContainers();

for (const c of containers) {

try {

// Skip if in recreate cooldown
const cooldownUntil = cooldownTracker.get(c.id);
if (cooldownUntil && Date.now() < cooldownUntil) {
continue;
}

systemStatus.services[serviceName].status = newStatus;
systemStatus.services[serviceName].lastUpdated = new Date();
}
});
// Broadcast metrics update from webhook
wsBroadcaster.broadcast('METRICS', systemStatus);
}
res.json({ success: true });
});
// Ensure monitoring is active
await monitor.startMonitoring(c.id);

app.post('/api/action/:service/:type', async (req, res) => {
const { service, type } = req.params;
const serviceMap = { 'auth': 3001, 'payment': 3002, 'notification': 3003 };
const port = serviceMap[service];
const metrics = monitor.getMetrics(c.id);
if (!metrics) continue;

logActivity('info', `Triggering action '${type}' on service '${service}'`);
const cpuUsage = parseFloat(metrics.cpu) || 0;

if (!port) {
logActivity('warn', `Failed action '${type}': Invalid service '${service}'`);
return res.status(400).json({ success: false, error: 'Invalid service' });
}
const id = c.id;
const now = Date.now();
let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };
const riskScore = computeRiskScore(c, metrics, tracker.attempts);

try {
let mode = 'healthy';
if (type === 'crash' || type === 'down') mode = 'down';
if (type === 'degraded') mode = 'degraded';
if (type === 'slow') mode = 'slow';

await axios.post(`http://localhost:${port}/simulate/${mode}`, {}, { timeout: 5000 });
// Force a health check to update status immediately
await checkServiceHealth();

logActivity('success', `Successfully executed '${type}' on ${service}`);
res.json({ success: true, message: `${type} executed on ${service}` });
} catch (error) {
logActivity('error', `Action '${type}' on ${service} failed: ${error.message}`);
res.status(500).json({ success: false, error: error.message });
}
});
console.log(`📊 ${c.name} Risk Score: ${riskScore}`);

// --- DOCKER ENDPOINTS ---
if (riskScore >= 80) {

// Middleware for ID/Service validation (mock auth for docker endpoints)
const requireDockerAuth = (req, res, next) => {
// In a real app, check 'Authorization' header
// For now, assume authenticated if internal or trusted
next();
};
console.log(`🚨 CRITICAL risk for ${c.name}`);

const validateId = (req, res, next) => {
if (!req.params.id || typeof req.params.id !== 'string' || req.params.id.length < 1) {
return res.status(400).json({ error: 'Invalid ID provided' });
}
next();
};
if (tracker.attempts < MAX_RESTARTS) {

const validateScaleParams = (req, res, next) => {
const replicas = parseInt(req.params.replicas, 10);
if (!req.params.service || isNaN(replicas) || replicas < 0 || replicas > 100) {
return res.status(400).json({ error: 'Invalid scale parameters' });
}
next();
};
tracker.attempts++;
tracker.lastAttempt = now;
restartTracker.set(id, tracker);

app.get('/api/docker/containers', async (req, res) => {
try {
const containers = await listContainers();
// Use Promise.allSettled to handle monitoring setup concurrently without crashing
await Promise.allSettled(containers.map(c => monitor.startMonitoring(c.id)));

// Enrich with smart restart meta
const enrichedContainers = containers.map(c => {
const tracker = restartTracker.get(c.id) || { attempts: 0, lastAttempt: 0 };
return {
...c,
metrics: monitor.getMetrics(c.id), // Include current metrics snapshot
restartCount: tracker.attempts,
lastRestart: tracker.lastAttempt
};
});

res.json({ containers: enrichedContainers });
} catch (error) {
res.status(500).json({ error: error.message });
}
});
await healer.restartContainer(id);

app.get('/api/docker/health/:id', validateId, async (req, res) => {
try {
const health = await getContainerHealth(req.params.id);
res.json(health);
} catch (error) {
res.status(500).json({ error: error.message });
}
});
} else {

app.get('/api/docker/metrics/:id', validateId, (req, res) => {
const metrics = monitor.getMetrics(req.params.id);
res.json(metrics || { error: 'No metrics available' });
});
monitor.stopMonitoring(id);
restartTracker.delete(id);

app.post('/api/docker/try-restart/:id', requireDockerAuth, validateId, async (req, res) => {
const id = req.params.id;
const now = Date.now();
let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };
await healer.recreateContainer(id);

// Reset attempts if outside grace period
if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {
tracker.attempts = 0;
cooldownTracker.set(id, Date.now() + COOLDOWN_AFTER_RECREATE_MS);
}

if (tracker.attempts >= MAX_RESTARTS) {
return res.status(429).json({
allowed: false,
reason: 'Max restart attempts exceeded',
nextRetry: new Date(tracker.lastAttempt + GRACE_PERIOD_MS)
});
continue;
}

if (riskScore >= 60) {

console.log(`⚠️ WARNING risk for ${c.name}`);

await autoScaleIfNeeded(c, cpuUsage);
}
// Reset restart counter after grace period
if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {
tracker.attempts = 0;
}

/* ===============================
1️⃣ UNHEALTHY → RESTART / RECREATE
================================= */



/* ===============================
2️⃣ HIGH CPU → SCALE
================================= */

if (cpuUsage > SCALE_UP_CPU_THRESHOLD) {
await autoScaleIfNeeded(c, cpuUsage);
}

} catch (innerErr) {

if (innerErr.statusCode === 404) continue;

console.error("Container loop error:", innerErr.message);
}
}

} catch (err) {
console.error("Background monitor error:", err.message);
}

tracker.attempts++;
tracker.lastAttempt = now;
restartTracker.set(id, tracker);
}, 5000);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "backend" -o -name "index.js" | head -20

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 231


🏁 Script executed:

wc -l ./backend/index.js

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 100


🏁 Script executed:

cat -n ./backend/index.js | sed -n '101,193p'

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 2892


🏁 Script executed:

cat -n ./backend/index.js | head -100

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 3631


🏁 Script executed:

cat -n ./backend/index.js | tail -30

Repository: SKfaizan-786/sentinel-devops-agent

Length of output: 992


Overlapping monitor cycles create race conditions on container restart/recreate and scaling operations.

When the async callback takes longer than 5000ms, setInterval fires the next iteration before the previous completes. This allows concurrent mutations to restartTracker, cooldownTracker, and scaleTracker, and multiple healer.restartContainer() or healer.recreateContainer() calls on the same container.

Proposed fix
+let monitorLoopInFlight = false;
 setInterval(async () => {
+  if (monitorLoopInFlight) return;
+  monitorLoopInFlight = true;
   try {
     const containers = await listContainers();
     ...
   } catch (err) {
     console.error("Background monitor error:", err.message);
+  } finally {
+    monitorLoopInFlight = false;
   }
 
 }, 5000);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
setInterval(async () => {
try {
const containers = await listContainers();
for (const c of containers) {
try {
// Skip if in recreate cooldown
const cooldownUntil = cooldownTracker.get(c.id);
if (cooldownUntil && Date.now() < cooldownUntil) {
continue;
}
systemStatus.services[serviceName].status = newStatus;
systemStatus.services[serviceName].lastUpdated = new Date();
}
});
// Broadcast metrics update from webhook
wsBroadcaster.broadcast('METRICS', systemStatus);
}
res.json({ success: true });
});
// Ensure monitoring is active
await monitor.startMonitoring(c.id);
app.post('/api/action/:service/:type', async (req, res) => {
const { service, type } = req.params;
const serviceMap = { 'auth': 3001, 'payment': 3002, 'notification': 3003 };
const port = serviceMap[service];
const metrics = monitor.getMetrics(c.id);
if (!metrics) continue;
logActivity('info', `Triggering action '${type}' on service '${service}'`);
const cpuUsage = parseFloat(metrics.cpu) || 0;
if (!port) {
logActivity('warn', `Failed action '${type}': Invalid service '${service}'`);
return res.status(400).json({ success: false, error: 'Invalid service' });
}
const id = c.id;
const now = Date.now();
let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };
const riskScore = computeRiskScore(c, metrics, tracker.attempts);
try {
let mode = 'healthy';
if (type === 'crash' || type === 'down') mode = 'down';
if (type === 'degraded') mode = 'degraded';
if (type === 'slow') mode = 'slow';
await axios.post(`http://localhost:${port}/simulate/${mode}`, {}, { timeout: 5000 });
// Force a health check to update status immediately
await checkServiceHealth();
logActivity('success', `Successfully executed '${type}' on ${service}`);
res.json({ success: true, message: `${type} executed on ${service}` });
} catch (error) {
logActivity('error', `Action '${type}' on ${service} failed: ${error.message}`);
res.status(500).json({ success: false, error: error.message });
}
});
console.log(`📊 ${c.name} Risk Score: ${riskScore}`);
// --- DOCKER ENDPOINTS ---
if (riskScore >= 80) {
// Middleware for ID/Service validation (mock auth for docker endpoints)
const requireDockerAuth = (req, res, next) => {
// In a real app, check 'Authorization' header
// For now, assume authenticated if internal or trusted
next();
};
console.log(`🚨 CRITICAL risk for ${c.name}`);
const validateId = (req, res, next) => {
if (!req.params.id || typeof req.params.id !== 'string' || req.params.id.length < 1) {
return res.status(400).json({ error: 'Invalid ID provided' });
}
next();
};
if (tracker.attempts < MAX_RESTARTS) {
const validateScaleParams = (req, res, next) => {
const replicas = parseInt(req.params.replicas, 10);
if (!req.params.service || isNaN(replicas) || replicas < 0 || replicas > 100) {
return res.status(400).json({ error: 'Invalid scale parameters' });
}
next();
};
tracker.attempts++;
tracker.lastAttempt = now;
restartTracker.set(id, tracker);
app.get('/api/docker/containers', async (req, res) => {
try {
const containers = await listContainers();
// Use Promise.allSettled to handle monitoring setup concurrently without crashing
await Promise.allSettled(containers.map(c => monitor.startMonitoring(c.id)));
// Enrich with smart restart meta
const enrichedContainers = containers.map(c => {
const tracker = restartTracker.get(c.id) || { attempts: 0, lastAttempt: 0 };
return {
...c,
metrics: monitor.getMetrics(c.id), // Include current metrics snapshot
restartCount: tracker.attempts,
lastRestart: tracker.lastAttempt
};
});
res.json({ containers: enrichedContainers });
} catch (error) {
res.status(500).json({ error: error.message });
}
});
await healer.restartContainer(id);
app.get('/api/docker/health/:id', validateId, async (req, res) => {
try {
const health = await getContainerHealth(req.params.id);
res.json(health);
} catch (error) {
res.status(500).json({ error: error.message });
}
});
} else {
app.get('/api/docker/metrics/:id', validateId, (req, res) => {
const metrics = monitor.getMetrics(req.params.id);
res.json(metrics || { error: 'No metrics available' });
});
monitor.stopMonitoring(id);
restartTracker.delete(id);
app.post('/api/docker/try-restart/:id', requireDockerAuth, validateId, async (req, res) => {
const id = req.params.id;
const now = Date.now();
let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };
await healer.recreateContainer(id);
// Reset attempts if outside grace period
if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {
tracker.attempts = 0;
cooldownTracker.set(id, Date.now() + COOLDOWN_AFTER_RECREATE_MS);
}
if (tracker.attempts >= MAX_RESTARTS) {
return res.status(429).json({
allowed: false,
reason: 'Max restart attempts exceeded',
nextRetry: new Date(tracker.lastAttempt + GRACE_PERIOD_MS)
});
continue;
}
if (riskScore >= 60) {
console.log(`⚠️ WARNING risk for ${c.name}`);
await autoScaleIfNeeded(c, cpuUsage);
}
// Reset restart counter after grace period
if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {
tracker.attempts = 0;
}
/* ===============================
1️⃣ UNHEALTHY RESTART / RECREATE
================================= */
/* ===============================
2️⃣ HIGH CPU SCALE
================================= */
if (cpuUsage > SCALE_UP_CPU_THRESHOLD) {
await autoScaleIfNeeded(c, cpuUsage);
}
} catch (innerErr) {
if (innerErr.statusCode === 404) continue;
console.error("Container loop error:", innerErr.message);
}
}
} catch (err) {
console.error("Background monitor error:", err.message);
}
tracker.attempts++;
tracker.lastAttempt = now;
restartTracker.set(id, tracker);
}, 5000);
let monitorLoopInFlight = false;
setInterval(async () => {
if (monitorLoopInFlight) return;
monitorLoopInFlight = true;
try {
const containers = await listContainers();
for (const c of containers) {
try {
// Skip if in recreate cooldown
const cooldownUntil = cooldownTracker.get(c.id);
if (cooldownUntil && Date.now() < cooldownUntil) {
continue;
}
// Ensure monitoring is active
await monitor.startMonitoring(c.id);
const metrics = monitor.getMetrics(c.id);
if (!metrics) continue;
const cpuUsage = parseFloat(metrics.cpu) || 0;
const id = c.id;
const now = Date.now();
let tracker = restartTracker.get(id) || { attempts: 0, lastAttempt: 0 };
const riskScore = computeRiskScore(c, metrics, tracker.attempts);
console.log(`📊 ${c.name} Risk Score: ${riskScore}`);
if (riskScore >= 80) {
console.log(`🚨 CRITICAL risk for ${c.name}`);
if (tracker.attempts < MAX_RESTARTS) {
tracker.attempts++;
tracker.lastAttempt = now;
restartTracker.set(id, tracker);
await healer.restartContainer(id);
} else {
monitor.stopMonitoring(id);
restartTracker.delete(id);
await healer.recreateContainer(id);
cooldownTracker.set(id, Date.now() + COOLDOWN_AFTER_RECREATE_MS);
}
continue;
}
if (riskScore >= 60) {
console.log(`⚠️ WARNING risk for ${c.name}`);
await autoScaleIfNeeded(c, cpuUsage);
}
// Reset restart counter after grace period
if (now - tracker.lastAttempt > GRACE_PERIOD_MS) {
tracker.attempts = 0;
}
/* ===============================
1️⃣ UNHEALTHY RESTART / RECREATE
================================= */
/* ===============================
2️⃣ HIGH CPU SCALE
================================= */
if (cpuUsage > SCALE_UP_CPU_THRESHOLD) {
await autoScaleIfNeeded(c, cpuUsage);
}
} catch (innerErr) {
if (innerErr.statusCode === 404) continue;
console.error("Container loop error:", innerErr.message);
}
}
} catch (err) {
console.error("Background monitor error:", err.message);
} finally {
monitorLoopInFlight = false;
}
}, 5000);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/index.js` around lines 101 - 193, The background loop uses
setInterval which allows overlapping runs and races on restartTracker,
cooldownTracker, scaleTracker and concurrent calls to
healer.restartContainer/healer.recreateContainer/autoScaleIfNeeded; replace the
setInterval pattern with a single async loop (e.g., an async function that
awaits a delay each iteration) or implement per-container locks (a Map of
mutexes/Promise-chains) and acquire the lock at the start of processing each
container id, releasing it after all actions (monitor.startMonitoring,
computeRiskScore, healer.restartContainer, healer.recreateContainer,
autoScaleIfNeeded, monitor.stopMonitoring) complete; ensure mutations to
restartTracker/cooldownTracker/scaleTracker and calls to healer.* are performed
only while the lock is held so no two iterations can act on the same container
concurrently.

Comment thread backend/index.js
Comment on lines +110 to +112
const cooldownUntil = cooldownTracker.get(c.id);
if (cooldownUntil && Date.now() < cooldownUntil) {
continue;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Recreate cooldown is tracked on the old container id.

Line 149 stores cooldown using the id being removed; subsequent loops read cooldown by current container id (Line 110), so the recreated container bypasses cooldown immediately.

Proposed fix
-        const cooldownUntil = cooldownTracker.get(c.id);
+        const cooldownKey = c.name.split('-replica-')[0];
+        const cooldownUntil = cooldownTracker.get(cooldownKey);
...
-    cooldownTracker.set(id, Date.now() + COOLDOWN_AFTER_RECREATE_MS);
+    cooldownTracker.set(c.name.split('-replica-')[0], Date.now() + COOLDOWN_AFTER_RECREATE_MS);

Also applies to: 149-149

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/index.js` around lines 110 - 112, The cooldown is stored under the
old container id but checked against the current container id
(cooldownTracker.get(c.id)), so recreated containers bypass cooldown; fix by
keying cooldowns with a stable identifier used in both places (e.g., container
image or container name) instead of the transient container id: choose a stable
key (like const key = c.Image || c.Names?.[0]) and replace both
cooldownTracker.get(c.id) and the place that sets the cooldown (the code that
currently uses the removed container id) to use cooldownTracker.get(key) and
cooldownTracker.set(key, ...), ensuring the same symbol (the chosen key) is used
for lookup and storage.

Comment thread backend/index.js
Comment on lines +130 to 133
if (riskScore >= 80) {

// Middleware for ID/Service validation (mock auth for docker endpoints)
const requireDockerAuth = (req, res, next) => {
// In a real app, check 'Authorization' header
// For now, assume authenticated if internal or trusted
next();
};
console.log(`🚨 CRITICAL risk for ${c.name}`);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Risk action thresholds don’t match the documented risk bands.

Current routing uses Critical >=80 and Warning >=60, but the stated policy is Warning 40–69 and Critical 70–100. This materially changes escalation behavior.

Proposed fix
+const WARNING_RISK_THRESHOLD = 40;
+const CRITICAL_RISK_THRESHOLD = 70;
...
-if (riskScore >= 80) {
+if (riskScore >= CRITICAL_RISK_THRESHOLD) {
...
-if (riskScore >= 60) {
+if (riskScore >= WARNING_RISK_THRESHOLD) {

Also applies to: 155-158

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/index.js` around lines 130 - 133, The risk routing condition using
riskScore currently treats Critical as >=80 and Warning as >=60; update the
conditional logic in backend/index.js (the blocks checking riskScore, e.g., the
if that logs "🚨 CRITICAL risk for ${c.name}" and the corresponding Warning
branch around lines referenced) so Critical is riskScore >= 70 and Warning is
when riskScore is between 40 and 69 inclusive (i.e., >=40 && <=69); ensure any
else/other branches cover <40 as normal/low and apply the same fixes to the
second occurrence noted (also applies to the other riskScore check).

@mdhaarishussain mdhaarishussain linked an issue Mar 1, 2026 that may be closed by this pull request
12 tasks

@mdhaarishussain mdhaarishussain left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat: implement risk-based orchestration engine

CHANGES REQUESTED — This PR has multiple critical and major bugs that would cause production failures. The mergeable_state is also dirty (merge conflicts must be resolved before this can land). Please address all issues below before re-requesting review.


🔴 Critical Issues

1. setInterval race condition — overlapping monitor loops mutate shared state concurrently
index.js: The async callback in setInterval(..., 5000) has no in-flight guard. When Docker operations (restart, recreate, list containers) take longer than 5 s, the next tick fires before the previous resolves. Multiple iterations can call healer.restartContainer(id) and healer.recreateContainer(id) on the same container simultaneously, and restartTracker / cooldownTracker / scaleTracker can be mutated by two iterations at once. (See inline comment on line 193.)

2. Risk thresholds in index.js contradict the documented risk bands
PR docs: 40–69 Warning, 70–100 Critical. Code uses >= 80 / >= 60. Containers with score 70–79 are never restarted/recreated; scores 40–59 receive no action at all. (See inline comment on line 133.)

3. Cooldown tracker keyed on transient container ID — cooldown bypassed after every recreation
healer.recreateContainer() produces a container with a new ID. cooldownTracker.get(c.id) on the next tick finds nothing → recreate fires again immediately, looping indefinitely. (See inline comment on line 112.)

4. Duplicate healing — supervisor.start() AND setInterval in index.js both independently restart the same containers
Both poll every 5 s and call healer.restartContainer() without any coordination. An unhealthy container will receive simultaneous restart calls from both systems, doubling attempts and making all tracker state unreliable.


🟠 Major Issues

5. supervisor.js: handleFailure() not awaited — unhandled promise rejections + stale healingInProgress state
Calling this.handleFailure(c.id) without await/.catch() causes unhandled rejections (process exit in Node 15+). If any await inside throws, healingInProgress.delete(containerId) never runs → permanent stale state blocking all future healing for that container. (See inline comment on supervisor.js line 20.)

6. autoScaleIfNeeded() always sees currentReplicas = 1
Replicas are named <name>-replica-<i> but the filter uses c.name === serviceName (exact). Replicas never match, so the count is always 1 regardless of reality. (See inline comment on index.js line 57.)

7. scaleService() broad substring matching can scale/remove wrong containers
n.includes(containerName) with containerName = "web" matches my-web, web-proxy, etc. The SCALE DOWN path calls container.stop() + container.remove() on those unrelated containers. (See inline comment on healer.js line 67.)

8. Replicas not assigned sentinel.monitor=true label + duplicate PortBindings
New replicas are invisible to monitoring (label missing) and creation fails for replica 2+ due to port binding conflicts (HostConfig cloned as-is). (See inline comment on healer.js line 77.)

9. Scale cooldown set unconditionally on scale failure
scaleTracker.set(...) fires even when healer.scaleService() returns { success: false }, blocking retries for the full 2-minute cooldown.

10. Breaking API changes not disclosed
The PR claims "no breaking changes" but removes: /api/status, /api/activity, /api/insights, /api/kestra-webhook (breaks AI pipeline), all /api/docker/* management endpoints, and all RBAC routes (/auth, /api/users, /api/roles). The frontend depends on all of these.


🟡 Should Fix / Nitpick

11. waitForHealthy() in supervisor.js burns the full 5-minute timeout for terminal states (removed, error). Add an early-exit when a non-recoverable status is detected.

12. autoScaleIfNeeded(c, cpuUsage) is called twice per iteration when both riskScore >= 60 AND cpuUsage > SCALE_UP_CPU_THRESHOLD. Use else if or a combined guard.

13. package-lock.json includes bcrypt, jsonwebtoken, pg, dotenv, uuid — none are used by this PR's code changes. Remove these extraneous lock-file entries.

14. Docstring coverage is 40% against a required 80% threshold (flagged by CodeRabbit pre-merge check). All new public functions need JSDoc.


Merge Conflicts

The PR's mergeable_state is dirty — conflicts with main must be resolved before this can be reviewed for merge.

Comment thread backend/index.js
tracker.lastAttempt = now;
restartTracker.set(id, tracker);
}, 5000);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: setInterval allows overlapping async loops — concurrent mutations to shared state

When Docker ops (listContainers, restartContainer, recreateContainer) take longer than 5 s, the next tick fires before the previous resolves. Multiple iterations can race on restartTracker, cooldownTracker, scaleTracker, and call healer.restartContainer(id) or healer.recreateContainer(id) on the same container simultaneously.

let monitorLoopInFlight = false;
setInterval(async () => {
  if (monitorLoopInFlight) return;
  monitorLoopInFlight = true;
  try {
    // ...existing body...
  } catch (err) {
    console.error("Background monitor error:", err.message);
  } finally {
    monitorLoopInFlight = false;
  }
}, 5000);

Comment thread backend/index.js
// Skip if in recreate cooldown
const cooldownUntil = cooldownTracker.get(c.id);
if (cooldownUntil && Date.now() < cooldownUntil) {
continue;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: Cooldown stored under old container ID — bypassed after every recreation

healer.recreateContainer(id) removes the old container and starts a new one with a new ID. On the next loop iteration, cooldownTracker.get(c.id) looks up the new ID and finds nothing — the cooldown is permanently bypassed, triggering another recreation immediately.

Fix: Key cooldowns on container name (stable across recreation):

const cooldownKey = c.name;
const cooldownUntil = cooldownTracker.get(cooldownKey);
if (cooldownUntil && Date.now() < cooldownUntil) continue;
// ...later when setting:
cooldownTracker.set(cooldownKey, Date.now() + COOLDOWN_AFTER_RECREATE_MS);

Comment thread backend/index.js
next();
};
console.log(`🚨 CRITICAL risk for ${c.name}`);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: Risk thresholds (80/60) contradict documented bands (70/40)

PR documentation: 40–69 Warning, 70–100 Critical.
Code uses >= 80 Critical and >= 60 Warning.

Containers with risk score 70–79 only hit the warning branch (auto-scale), and are never restarted or recreated. Containers with risk score 40–59 fall through all branches entirely — no action taken.

// Use named constants matching the documented spec
const CRITICAL_RISK_THRESHOLD = 70;  // not 80
const WARNING_RISK_THRESHOLD = 40;   // not 60

if (riskScore >= CRITICAL_RISK_THRESHOLD) { ... }
else if (riskScore >= WARNING_RISK_THRESHOLD) { ... }

Comment thread backend/index.js
}
const allContainers = await listContainers();
const sameService = allContainers.filter(c => c.name === serviceName);
const currentReplicas = sameService.length;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: Exact name match misses all replicas — currentReplicas is always 1

scaleService() in healer.js creates replicas named <containerName>-replica-<i>, but this filter uses c.name === serviceName (exact match). Replicas never match, so currentReplicas is always 1 regardless of how many are running.

Example: 4 replicas running → currentReplicas = 1 → code calls scaleService(name, 2) → creates a fifth container instead of doing nothing.

const sameService = allContainers.filter(c =>
  c.name === serviceName || c.name.startsWith(`${serviceName}-replica-`)
);

(health.status === 'unhealthy' || c.status === 'exited') &&
!this.healingInProgress.has(c.id)
) {
this.handleFailure(c.id);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical: handleFailure() not awaited — unhandled rejection crashes Node.js process

Fire-and-forget without .catch(). If restartContainer() or waitForHealthy() rejects, it becomes an unhandled promise rejection, which in Node.js 15+ terminates the process.

Additionally, if any await inside handleFailure() throws, healingInProgress.delete(containerId) on the last line never runs → that container ID stays in the Set permanently → future failures for this container are silently ignored forever.

// call site fix:
void this.handleFailure(c.id).catch(err => {
    console.error(`Healing flow failed for ${c.id}:`, err.message);
});

// handleFailure fix:
async handleFailure(containerId) {
    this.healingInProgress.add(containerId);
    try {
        this.lifecycle.set(containerId, { detectedAt: Date.now(), healingTriggeredAt: Date.now(), restoredAt: null });
        await restartContainer(containerId);
        const restored = await this.waitForHealthy(containerId);
        if (restored) {
            const record = this.lifecycle.get(containerId);
            record.restoredAt = Date.now();
        }
    } finally {
        this.healingInProgress.delete(containerId);
    }
}

Comment thread backend/docker/healer.js
if (replicas > currentCount) {

const baseContainer = docker.getContainer(sameService[0].Id);
const info = await baseContainer.inspect();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: Broad substring matching — wrong containers can be stopped and removed

n.includes(containerName) is too permissive. A containerName of "web" matches web-app, my-web-server, web-proxy, etc. The SCALE DOWN path will then container.stop() and container.remove() those unrelated containers, causing outages.

Also no null guard: sameService[0] on line 66 throws TypeError if sameService is empty.

const sameService = containers.filter(c =>
  c.Names && c.Names.some(n => {
    const name = n.replace(/^\//, '');
    return name === containerName || name.startsWith(`${containerName}-replica-`);
  })
);
if (sameService.length === 0) {
  return { success: false, error: `No containers found for ${containerName}` };
}

Comment thread backend/docker/healer.js
Image: info.Config.Image,
name: newName,
Env: info.Config.Env,
HostConfig: info.HostConfig

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major: Replicas created without sentinel.monitor=true label and with conflicting PortBindings

Two bugs in this createContainer call:

  1. Labels from info.Config.Labels is omitted. Since listContainers() filters by { "sentinel.monitor": "true" }, new replicas are invisible to monitoring — they run unsupervised indefinitely.

  2. HostConfig: info.HostConfig copies PortBindings as-is. Every replica tries to bind the same host port(s) → container creation fails for replica 2+, or causes port conflicts if the OS allows it.

const hostConfig = { ...info.HostConfig, PortBindings: {} };
const newContainer = await docker.createContainer({
  Image: info.Config.Image,
  name: newName,
  Env: info.Config.Env,
  Labels: info.Config.Labels,   // inherit sentinel.monitor=true label
  HostConfig: hostConfig        // cleared PortBindings
});

@mdhaarishussain

Copy link
Copy Markdown
Collaborator

ANy Updates @srija-pixel

@mdhaarishussain

Copy link
Copy Markdown
Collaborator

@srija-pixel if you do not respond within today i will have to close this PR and re assign to someone else due to time constraints

@mdhaarishussain mdhaarishussain left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested.

The orchestration engine concept (risk scoring, CPU-based auto-scaling, the new supervisor.js with lifecycle tracking) is interesting, but this PR removes RBAC routes and other existing functionality from index.js to implement it:

  • app.use('/auth', authRoutes), app.use('/api/users', ...), app.use('/api/roles', ...) — RBAC route registrations are deleted
  • checkServiceHealth() and the in-memory system status are removed
  • Activity log system (logActivity, activityLog) is removed

Removing working core features to replace them with an orchestration engine is not acceptable in a single PR. The risk scoring engine and supervisor should be added alongside existing functionality, not as a replacement.

Additionally, supervisor.js's start() method calls this.checkContainers() inside setInterval without .catch(), which will produce unhandled promise rejections on any error in Node.js.

backend/package-lock.json is also committed. Please scope this PR to only adding the orchestration engine without removing existing routes and systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🧪 Chaos Engineering & Self-Validation Framework

2 participants