Skip to content

Add retry/backoff and circuit breaker to health checks#17

Open
zhangsiqiang519 wants to merge 1 commit into
thanhle74:mainfrom
zhangsiqiang519:codex/health-check-retry-circuit-15
Open

Add retry/backoff and circuit breaker to health checks#17
zhangsiqiang519 wants to merge 1 commit into
thanhle74:mainfrom
zhangsiqiang519:codex/health-check-retry-circuit-15

Conversation

@zhangsiqiang519

@zhangsiqiang519 zhangsiqiang519 commented Jun 20, 2026

Copy link
Copy Markdown

Summary

Adds retry/backoff and circuit-breaker support to HTTP health probes in tools/health_check.py, plus result summary counts for operators.

Changes

  • Adds --max-retries, --backoff-factor, --circuit-threshold, and --circuit-cooldown CLI flags.
  • Retries transient HTTP probe failures with exponential backoff.
  • Tracks per-endpoint circuit breakers and returns a clear open-circuit result instead of hammering failed services.
  • Adds WARNING-level logging for retries, open circuits, and degraded services.
  • Adds JSON/text summary counts for OK, WARNING, CRITICAL, and open circuits.
  • Adds 6 focused unit tests for retry, backoff, circuit behavior, summary aggregation, and CLI parsing.
  • Documents the retry/circuit workflow in docs/OPERATIONS.md.

Testing

  • python3 -m unittest tests/test_health_check_retry_circuit.py (6 tests)
  • python3 -m py_compile tools/health_check.py tests/test_health_check_retry_circuit.py
  • python3 tools/health_check.py --service backend --max-retries 1 --backoff-factor 0 --circuit-threshold 2 --json (smoke run exits 0 and reports degraded local backend)
  • git diff --check
  • python3 build.py generated real diagnostics. It exited 1 because this runner is missing cargo, go, luac, and ghc, and the existing frailbox C build fails on this platform. Frontend, engine, compliance, and v2 market stream passed.

Diagnostic artifacts:

  • diagnostic/build-94e0fb08.logd
  • diagnostic/build-94e0fb08.json

Decrypt command:

encryptly unpack diagnostic/build-94e0fb08.logd <outdir> --password efcd3c4a127e4b6e294d

Checklist

  • Relevant modules affected by these changes build locally
  • Tests pass locally
  • Diagnostic build log is committed in this PR
  • Documentation has been updated, if applicable
  • Configuration or schema changes are documented, if applicable
  • No generated build artifacts are committed, except the required diagnostic build log
  • Changes are scoped to the PR purpose and avoid unrelated cleanup
  • Security, privacy, and error-handling implications have been considered

  • I would like to request that my diagnostic build log is removed before merging

Closes #15

Summary by CodeRabbit

  • New Features

    • Health probes now automatically retry failed checks with configurable exponential backoff
    • Prevents repeated checks to failing services after reaching a failure threshold
    • Enhanced health reports with aggregated status summaries and summary counts
  • Documentation

    • Added operational guide for configuring health probe retry behavior and failure thresholds
  • Tests

    • Added comprehensive test coverage for health probe retry and failure management features

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

tools/health_check.py is extended with a CircuitBreaker class, retry/exponential-backoff logic, and result aggregation via summarize_results. New CLI flags (--max-retries, --backoff-factor, --circuit-threshold, --circuit-cooldown) are added and wired through run_health_checks. A new test module and an operations documentation subsection are added, plus a generated diagnostic build JSON artifact.

Changes

HTTP Probe Retry and Circuit Breaker

Layer / File(s) Summary
CircuitBreaker abstraction and HTTP probe refactor
tools/health_check.py
Adds CircuitBreaker with allow_request, record_success, record_failure, and state; adds helpers endpoint_key, get_circuit_breaker, perform_http_request, classify_http_status, retry_delay; rewrites check_http_service to retry with exponential backoff and short-circuit when the breaker is open. Also expands imports and reformats INFRASTRUCTURE config dicts.
Result aggregation, reporting, and CLI wiring
tools/health_check.py
Adds summarize_results to count OK/WARNING/CRITICAL and open circuits; updates run_health_checks to accept and pass retry/circuit params, attach circuit state per service, and store results["summary"]; updates print_health_report to display summary counts and icon-driven (✓/⚠/✗) status output; adds --max-retries, --backoff-factor, --circuit-threshold, --circuit-cooldown to parse_args and wires them through main().
Unit tests for retry, backoff, circuit breaker, and aggregation
tests/test_health_check_retry_circuit.py
Tests probe retry-until-success with attempt count in detail; exponential backoff sleep durations with CRITICAL on all-timeout; circuit breaker blocking after threshold; cooldown reset via injectable clock; summarize_results aggregation counts; and CLI flag parsing.
Operations documentation
docs/OPERATIONS.md
Adds subsection documenting retry/backoff/circuit-breaker CLI flags, the backoff delay formula, open-circuit behavior, and output summary counts.

Diagnostic Build Artifact

Layer / File(s) Summary
Build diagnostic JSON
diagnostic/build-94e0fb08.json
Generated JSON for commit 94e0fb08 with per-module PASS/FAIL status, timing, captured outputs, and a pr_note referencing the encrypted .logd artifact.

Sequence Diagram(s)

sequenceDiagram
  participant main
  participant run_health_checks
  participant get_circuit_breaker
  participant CircuitBreaker
  participant check_http_service
  participant perform_http_request
  participant summarize_results

  main->>run_health_checks: max_retries, backoff_factor, circuit_threshold, circuit_cooldown
  run_health_checks->>get_circuit_breaker: endpoint_key(host, port, path)
  get_circuit_breaker-->>run_health_checks: CircuitBreaker instance
  run_health_checks->>check_http_service: ..., circuit_breaker=CB
  check_http_service->>CircuitBreaker: allow_request()
  alt circuit open
    CircuitBreaker-->>check_http_service: False → CRITICAL "circuit open"
  else circuit closed
    loop attempt in range(max_retries+1)
      check_http_service->>perform_http_request: host, port, path, timeout
      alt response received
        perform_http_request-->>check_http_service: status code + body
        check_http_service->>CircuitBreaker: record_success()
        check_http_service-->>run_health_checks: OK/WARNING + detail
      else timeout/error
        check_http_service->>CircuitBreaker: record_failure()
        check_http_service->>check_http_service: sleep(retry_delay(attempt))
      end
    end
    check_http_service-->>run_health_checks: CRITICAL + detail
  end
  run_health_checks->>summarize_results: results dict
  summarize_results-->>run_health_checks: {ok, warning, critical, open_circuits}
  run_health_checks-->>main: results + summary
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #15 [$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes — This PR directly implements all acceptance criteria from that issue: --max-retries, --backoff-factor, --circuit-threshold flags; exponential backoff; circuit breaker with cooldown; summarize_results aggregation; at least 5 new unit tests; and the required diagnostic/build-*.json artifact.

Poem

🐇 Hoppity hop through the network I go,
When servers go dark I retry, nice and slow.
Exponential backoff — the carrot must wait,
The circuit trips open if failures grow great.
After cooldown I peek: is the burrow back bright?
✓ OK, the warren is humming — just right! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely describes the main change: adding retry/backoff and circuit breaker functionality to health checks, which aligns with the primary objective of the changeset.
Description check ✅ Passed The PR description comprehensively covers all required template sections: Summary, Changes, Testing (with test execution and diagnostic details), and completed Checklist with all items marked.
Linked Issues check ✅ Passed All coding requirements from issue #15 are met: retry/backoff with exponential backoff formula implemented, circuit breaker pattern with threshold/cooldown implemented, CLI flags added, result summary aggregation added, WARNING-level logging added, 6 unit tests added exceeding the 5-test requirement, and diagnostic artifacts generated.
Out of Scope Changes check ✅ Passed All changes are directly scoped to PR objectives: core retry/backoff/circuit-breaker implementation, supporting tests, documentation updates, and diagnostic artifacts. No unrelated cleanup or extraneous changes present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@diagnostic/build-94e0fb08.json`:
- Around line 26-27: The diagnostic JSON file contains absolute file paths in
the artifact field (and potentially in build.log and build-summary.txt fields)
that expose the builder's username and local workspace structure. Locate the
code responsible for generating this diagnostic JSON file and modify it to
convert these absolute paths to repository-relative paths before serializing and
writing the JSON. The conversion should strip the machine-specific prefix and
maintain only the path relative to the repository root, since these fields are
used for display and logging purposes only. Apply this path conversion
consistently to the artifact field on lines 26 and 47 and 54, as well as to any
build.log and build-summary.txt fields.
- Around line 9-10: Remove the plaintext password value from the "password"
field in the diagnostic/build-94e0fb08.json metadata file and redact or remove
the embedded password from the "decrypt_command" field to prevent the secret
from persisting in git history and being visible to reviewers. Additionally,
review and replace any absolute filesystem paths (such as those containing
usernames like "/Users/zhangsiqiang/...") with relative paths or truncated
versions that do not expose system-specific details. Coordinate with maintainers
to modify the CI validation workflow to pass decryption credentials through
GitHub encrypted secrets or environment variables instead of committing them in
the JSON metadata.

In `@tests/test_health_check_retry_circuit.py`:
- Around line 13-14: The HealthCheckRetryCircuitTest class does not isolate its
tests from the global HTTP_CIRCUIT_BREAKERS state that is shared across test
runs. Add a setUp method and/or tearDown method to the
HealthCheckRetryCircuitTest class to clear the HTTP_CIRCUIT_BREAKERS global
state before and/or after each test method executes. This ensures that each test
method in the class starts with a clean circuit breaker state, making test
outcomes independent of execution order when tests are reordered or sharded.

In `@tools/health_check.py`:
- Around line 213-230: The circuit breaker's protection is not being enforced
during retries because the code continues to sleep and retry even after
record_failure() opens the breaker. After each call to breaker.record_failure()
in both the HTTP status check and exception handler blocks, check if the breaker
has opened using an appropriate method (likely something like breaker.is_open()
or similar). If the breaker is open, immediately return with a CRITICAL status
and an appropriate error message instead of allowing the code to continue to the
delay and sleep logic, ensuring the breaker stops the retry attempts once it
opens.
- Around line 483-496: The current if/else logic in the status rendering code
creates a mutual exclusion where parent service checks with a "status" field are
printed but their nested sub-checks are skipped entirely. To fix this,
restructure the code to check for and print the parent status first (when the
check is a dict with "status"), then independently iterate through any nested
sub-checks to print them as well. Replace the else branch that only handles
nested checks with logic that always processes nested items (sub_checks)
regardless of whether the parent check has a status field, ensuring that both
parent statuses and nested sub-checks like certificates are rendered in the
output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0f1d543-98ca-4580-8224-5fa6bd17f3cf

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and 9019542.

📒 Files selected for processing (5)
  • diagnostic/build-94e0fb08.json
  • diagnostic/build-94e0fb08.logd
  • docs/OPERATIONS.md
  • tests/test_health_check_retry_circuit.py
  • tools/health_check.py

Comment on lines +9 to +10
"password": "efcd3c4a127e4b6e294d",
"decrypt_command": "encryptly unpack diagnostic/build-94e0fb08.logd <outdir> --password efcd3c4a127e4b6e294d",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for the diagnostic file and related build configuration
git ls-files | grep -E "(diagnostic|build)" | head -20

Repository: thanhle74/kickama

Length of output: 267


🏁 Script executed:

# Check if the diagnostic directory exists and list its contents
ls -la diagnostic/ 2>/dev/null || echo "diagnostic directory not found"

Repository: thanhle74/kickama

Length of output: 482


🏁 Script executed:

# Search for build.py or similar build configuration files
fd "build\.py" -o fd "\.github" -type d

Repository: thanhle74/kickama

Length of output: 185


🏁 Script executed:

# Read the diagnostic JSON file to verify the password presence
cat diagnostic/build-94e0fb08.json

Repository: thanhle74/kickama

Length of output: 5120


🏁 Script executed:

# Read build.py to understand how the artifact is created
cat -n build.py | head -100

Repository: thanhle74/kickama

Length of output: 3742


🏁 Script executed:

# Check the CI workflow configuration
cat .github/workflows/diagnostic-build-log.yml

Repository: thanhle74/kickama

Length of output: 10279


🏁 Script executed:

# Read more of build.py to understand password generation and encryption
wc -l build.py && tail -n +100 build.py | head -200

Repository: thanhle74/kickama

Length of output: 6347


🏁 Script executed:

# Search for password generation logic
rg "password" build.py -B 2 -A 2

Repository: thanhle74/kickama

Length of output: 1438


🏁 Script executed:

# Check for any environment variable handling for secrets
rg "environ|getenv|secret|token" build.py -i

Repository: thanhle74/kickama

Length of output: 212


Remove plaintext password from committed diagnostic metadata before merging.

The password and decrypt_command embed a bearer secret in the PR that will be visible to all reviewers and persist in git history until artifacts are removed. Although the PR note correctly states these diagnostic artifacts are temporary, the password is required by the CI validation workflow and cannot be omitted from the JSON without redesigning the CI contract to pass secrets differently (e.g., via GitHub encrypted secrets or environment variables). Verify with maintainers whether the CI workflow can be modified to handle decryption credentials outside of committed metadata, or ensure these artifacts are removed immediately after merge.

Additionally, artifact paths expose the builder's absolute filesystem layout (e.g., /Users/zhangsiqiang/Documents/Codex/...). Consider using relative paths or truncating paths to remove usernames and workspace details.

🧰 Tools
🪛 Betterleaks (1.5.0)

[high] 9-9: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@diagnostic/build-94e0fb08.json` around lines 9 - 10, Remove the plaintext
password value from the "password" field in the diagnostic/build-94e0fb08.json
metadata file and redact or remove the embedded password from the
"decrypt_command" field to prevent the secret from persisting in git history and
being visible to reviewers. Additionally, review and replace any absolute
filesystem paths (such as those containing usernames like
"/Users/zhangsiqiang/...") with relative paths or truncated versions that do not
expose system-specific details. Coordinate with maintainers to modify the CI
validation workflow to pass decryption credentials through GitHub encrypted
secrets or environment variables instead of committing them in the JSON
metadata.

Source: Linters/SAST tools

Comment on lines +26 to +27
"artifact": "/Users/zhangsiqiang/Documents/Codex/2026-06-19/new-chat/outputs/thanhle-kickama-15-health-retry/frontend/dist",
"output": "> tent-frontend@0.0.0 build\n> tsc -b && vite build\n\nvite v6.4.3 building for production...\ntransforming...\n\u2713 100 modules transformed.\nrendering chunks...\ncomputing gzip size...\ndist/index.html 0.62 kB \u2502 gzip: 0.34 kB\ndist/assets/state-BkjSKDbY.js 8.91 kB \u2502 gzip: 3.55 kB \u2502 map: 57.15 kB\ndist/assets/vendor-CREcWLHI.js 48.93 kB \u2502 gzip: 17.22 kB \u2502 map: 481.27 kB\ndist/assets/index-CyxcoTyU.js 231.32 kB \u2502 gzip: 72.02 kB \u2502 map: 1,044.42 kB\n\u2713 built in 645ms"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /tmp/repository 2>/dev/null || cd . && find . -name "diagnostic" -type d | head -5

Repository: thanhle74/kickama

Length of output: 74


🏁 Script executed:

find . -name "build-94e0fb08.json" -type f

Repository: thanhle74/kickama

Length of output: 94


🏁 Script executed:

cat -n diagnostic/build-94e0fb08.json 2>/dev/null | head -60

Repository: thanhle74/kickama

Length of output: 4600


🏁 Script executed:

grep -r "artifact" --include="*.py" --include="*.js" --include="*.ts" --include="*.json" --include="*.sh" . 2>/dev/null | grep -v "\.git" | grep -v "node_modules" | head -20

Repository: thanhle74/kickama

Length of output: 1884


🏁 Script executed:

grep -r "build-94e0fb08" . 2>/dev/null | grep -v "\.git"

Repository: thanhle74/kickama

Length of output: 601


🏁 Script executed:

ls -la diagnostic/ | head -20

Repository: thanhle74/kickama

Length of output: 482


🏁 Script executed:

grep -B 5 -A 5 '"artifact": binary' build.py

Repository: thanhle74/kickama

Length of output: 419


🏁 Script executed:

grep -B 10 -A 10 'def.*artifact' build.py | head -50

Repository: thanhle74/kickama

Length of output: 847


🏁 Script executed:

cat build.py | head -100

Repository: thanhle74/kickama

Length of output: 3042


🏁 Script executed:

grep -B 20 "artifact.*binary" build.py | head -60

Repository: thanhle74/kickama

Length of output: 2406


🏁 Script executed:

grep -A 5 "def build_module" build.py

Repository: thanhle74/kickama

Length of output: 184


🏁 Script executed:

grep -B 5 -A 15 "for name, success, elapsed, output, binary in results" build.py

Repository: thanhle74/kickama

Length of output: 2755


🏁 Script executed:

grep -B 30 "def build_module" build.py | tail -40

Repository: thanhle74/kickama

Length of output: 777


🏁 Script executed:

grep -A 50 "def build_module" build.py | head -60

Repository: thanhle74/kickama

Length of output: 1925


🏁 Script executed:

rg "binary" build.py -B 2 -A 2

Repository: thanhle74/kickama

Length of output: 2752


🏁 Script executed:

grep -A 15 "def verify_binary" build.py

Repository: thanhle74/kickama

Length of output: 540


Convert artifact paths to repo-relative format to prevent exposing builder username and workspace layout.

The absolute paths (e.g., /Users/zhangsiqiang/Documents/Codex/2026-06-19/new-chat/outputs/...) appear at lines 26, 47, and 54 in this diagnostic file that is committed to the repository. Since the artifact field serves only for display and logging purposes—not for actual file references—convert these to paths relative to the repository root before serializing to JSON (and consider the same for build.log and build-summary.txt). This has minimal implementation cost and eliminates private information from the committed diagnostic record.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@diagnostic/build-94e0fb08.json` around lines 26 - 27, The diagnostic JSON
file contains absolute file paths in the artifact field (and potentially in
build.log and build-summary.txt fields) that expose the builder's username and
local workspace structure. Locate the code responsible for generating this
diagnostic JSON file and modify it to convert these absolute paths to
repository-relative paths before serializing and writing the JSON. The
conversion should strip the machine-specific prefix and maintain only the path
relative to the repository root, since these fields are used for display and
logging purposes only. Apply this path conversion consistently to the artifact
field on lines 26 and 47 and 54, as well as to any build.log and
build-summary.txt fields.

Comment on lines +13 to +14
class HealthCheckRetryCircuitTest(unittest.TestCase):
def test_http_probe_retries_until_success(self) -> None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Isolate tests from global circuit-breaker state.

check_http_service reuses HTTP_CIRCUIT_BREAKERS, but this test class never clears it. That makes outcomes order-dependent when tests are reordered or sharded.

Proposed fix
 class HealthCheckRetryCircuitTest(unittest.TestCase):
+    def setUp(self) -> None:
+        health_check.HTTP_CIRCUIT_BREAKERS.clear()
+
     def test_http_probe_retries_until_success(self) -> None:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_health_check_retry_circuit.py` around lines 13 - 14, The
HealthCheckRetryCircuitTest class does not isolate its tests from the global
HTTP_CIRCUIT_BREAKERS state that is shared across test runs. Add a setUp method
and/or tearDown method to the HealthCheckRetryCircuitTest class to clear the
HTTP_CIRCUIT_BREAKERS global state before and/or after each test method
executes. This ensures that each test method in the class starts with a clean
circuit breaker state, making test outcomes independent of execution order when
tests are reordered or sharded.

Comment thread tools/health_check.py
Comment on lines +213 to +230
breaker.record_failure()
last_error = f"HTTP {status}: {body[:100]}"
if attempt == attempts_allowed - 1:
return "CRITICAL", f"{last_error} after {attempts} attempts", status
except Exception as e:
breaker.record_failure()
last_error = str(e)
if attempt == attempts_allowed - 1:
return "CRITICAL", f"{last_error} after {attempts} attempts", 0

delay = retry_delay(backoff_base_delay, backoff_factor, attempt)
LOGGER.warning(
"HTTP health probe failed for %s; retrying in %.2fs",
endpoint_key(host, port, path),
delay,
)
if delay > 0:
sleep(delay)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Circuit breaker is not enforced once it opens during retries.

After record_failure() opens the breaker, this loop still sleeps and keeps retrying until attempts_allowed is exhausted. That defeats the breaker’s protection goal and can continue hammering a failing endpoint.

Proposed fix
         try:
             status, body = request_once(host, port, path, timeout)
             if status < 500:
                 breaker.record_success()
                 return classify_http_status(status, body, attempts)

             breaker.record_failure()
             last_error = f"HTTP {status}: {body[:100]}"
+            if not breaker.allow_request():
+                LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path))
+                return "CRITICAL", f"Circuit breaker open after {attempts} attempts", status
             if attempt == attempts_allowed - 1:
                 return "CRITICAL", f"{last_error} after {attempts} attempts", status
         except Exception as e:
             breaker.record_failure()
             last_error = str(e)
+            if not breaker.allow_request():
+                LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path))
+                return "CRITICAL", f"Circuit breaker open after {attempts} attempts", 0
             if attempt == attempts_allowed - 1:
                 return "CRITICAL", f"{last_error} after {attempts} attempts", 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
breaker.record_failure()
last_error = f"HTTP {status}: {body[:100]}"
if attempt == attempts_allowed - 1:
return "CRITICAL", f"{last_error} after {attempts} attempts", status
except Exception as e:
breaker.record_failure()
last_error = str(e)
if attempt == attempts_allowed - 1:
return "CRITICAL", f"{last_error} after {attempts} attempts", 0
delay = retry_delay(backoff_base_delay, backoff_factor, attempt)
LOGGER.warning(
"HTTP health probe failed for %s; retrying in %.2fs",
endpoint_key(host, port, path),
delay,
)
if delay > 0:
sleep(delay)
breaker.record_failure()
last_error = f"HTTP {status}: {body[:100]}"
if not breaker.allow_request():
LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path))
return "CRITICAL", f"Circuit breaker open after {attempts} attempts", status
if attempt == attempts_allowed - 1:
return "CRITICAL", f"{last_error} after {attempts} attempts", status
except Exception as e:
breaker.record_failure()
last_error = str(e)
if not breaker.allow_request():
LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path))
return "CRITICAL", f"Circuit breaker open after {attempts} attempts", 0
if attempt == attempts_allowed - 1:
return "CRITICAL", f"{last_error} after {attempts} attempts", 0
delay = retry_delay(backoff_base_delay, backoff_factor, attempt)
LOGGER.warning(
"HTTP health probe failed for %s; retrying in %.2fs",
endpoint_key(host, port, path),
delay,
)
if delay > 0:
sleep(delay)
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 217-217: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 213 - 230, The circuit breaker's
protection is not being enforced during retries because the code continues to
sleep and retry even after record_failure() opens the breaker. After each call
to breaker.record_failure() in both the HTTP status check and exception handler
blocks, check if the breaker has opened using an appropriate method (likely
something like breaker.is_open() or similar). If the breaker is open,
immediately return with a CRITICAL status and an appropriate error message
instead of allowing the code to continue to the delay and sleep logic, ensuring
the breaker stops the retry attempts once it opens.

Comment thread tools/health_check.py
Comment on lines 483 to 496
if isinstance(check, dict) and "status" in check:
status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(check["status"], "?")
status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
check["status"], "?"
)
print(f" {status_icon} {name}: {check['detail']}")
else:
print(f" {name}:")
for sub_name, sub_check in check.items():
if isinstance(sub_check, dict) and "status" in sub_check:
sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(sub_check["status"], "?")
sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
sub_check["status"], "?"
)
print(f" {sub_icon} {sub_name}: {sub_check['detail']}")
print()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Nested service checks (e.g., certificate) are no longer rendered.

For service entries, the "status" branch prints and exits, so nested sub-checks like "certificate" are never shown. Critical cert failures can be hidden from text output.

Proposed fix
                 if isinstance(check, dict) and "status" in check:
                     status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
                         check["status"], "?"
                     )
                     print(f"    {status_icon} {name}: {check['detail']}")
+                    for sub_name, sub_check in check.items():
+                        if sub_name in {"status", "detail", "code", "attempts_allowed", "circuit", "endpoint"}:
+                            continue
+                        if isinstance(sub_check, dict) and "status" in sub_check:
+                            sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
+                                sub_check["status"], "?"
+                            )
+                            print(f"      {sub_icon} {sub_name}: {sub_check['detail']}")
                 else:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(check, dict) and "status" in check:
status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(check["status"], "?")
status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
check["status"], "?"
)
print(f" {status_icon} {name}: {check['detail']}")
else:
print(f" {name}:")
for sub_name, sub_check in check.items():
if isinstance(sub_check, dict) and "status" in sub_check:
sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(sub_check["status"], "?")
sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
sub_check["status"], "?"
)
print(f" {sub_icon} {sub_name}: {sub_check['detail']}")
print()
if isinstance(check, dict) and "status" in check:
status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
check["status"], "?"
)
print(f" {status_icon} {name}: {check['detail']}")
for sub_name, sub_check in check.items():
if sub_name in {"status", "detail", "code", "attempts_allowed", "circuit", "endpoint"}:
continue
if isinstance(sub_check, dict) and "status" in sub_check:
sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
sub_check["status"], "?"
)
print(f" {sub_icon} {sub_name}: {sub_check['detail']}")
else:
print(f" {name}:")
for sub_name, sub_check in check.items():
if isinstance(sub_check, dict) and "status" in sub_check:
sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
sub_check["status"], "?"
)
print(f" {sub_icon} {sub_name}: {sub_check['detail']}")
print()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 483 - 496, The current if/else logic in
the status rendering code creates a mutual exclusion where parent service checks
with a "status" field are printed but their nested sub-checks are skipped
entirely. To fix this, restructure the code to check for and print the parent
status first (when the check is a dict with "status"), then independently
iterate through any nested sub-checks to print them as well. Replace the else
branch that only handles nested checks with logic that always processes nested
items (sub_checks) regardless of whether the parent check has a status field,
ensuring that both parent statuses and nested sub-checks like certificates are
rendered in the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes

1 participant