Add retry/backoff and circuit breaker to health checks by zhangsiqiang519 · Pull Request #17 · thanhle74/kickama

zhangsiqiang519 · 2026-06-20T17:08:47Z

Summary

Adds retry/backoff and circuit-breaker support to HTTP health probes in tools/health_check.py, plus result summary counts for operators.

Changes

Adds --max-retries, --backoff-factor, --circuit-threshold, and --circuit-cooldown CLI flags.
Retries transient HTTP probe failures with exponential backoff.
Tracks per-endpoint circuit breakers and returns a clear open-circuit result instead of hammering failed services.
Adds WARNING-level logging for retries, open circuits, and degraded services.
Adds JSON/text summary counts for OK, WARNING, CRITICAL, and open circuits.
Adds 6 focused unit tests for retry, backoff, circuit behavior, summary aggregation, and CLI parsing.
Documents the retry/circuit workflow in docs/OPERATIONS.md.

Testing

python3 -m unittest tests/test_health_check_retry_circuit.py (6 tests)
python3 -m py_compile tools/health_check.py tests/test_health_check_retry_circuit.py
python3 tools/health_check.py --service backend --max-retries 1 --backoff-factor 0 --circuit-threshold 2 --json (smoke run exits 0 and reports degraded local backend)
git diff --check
python3 build.py generated real diagnostics. It exited 1 because this runner is missing cargo, go, luac, and ghc, and the existing frailbox C build fails on this platform. Frontend, engine, compliance, and v2 market stream passed.

Diagnostic artifacts:

diagnostic/build-94e0fb08.logd
diagnostic/build-94e0fb08.json

Decrypt command:

encryptly unpack diagnostic/build-94e0fb08.logd <outdir> --password efcd3c4a127e4b6e294d

Checklist

Relevant modules affected by these changes build locally
Tests pass locally
Diagnostic build log is committed in this PR
Documentation has been updated, if applicable
Configuration or schema changes are documented, if applicable
No generated build artifacts are committed, except the required diagnostic build log
Changes are scoped to the PR purpose and avoid unrelated cleanup
Security, privacy, and error-handling implications have been considered

I would like to request that my diagnostic build log is removed before merging

Closes #15

Summary by CodeRabbit

New Features
- Health probes now automatically retry failed checks with configurable exponential backoff
- Prevents repeated checks to failing services after reaching a failure threshold
- Enhanced health reports with aggregated status summaries and summary counts
Documentation
- Added operational guide for configuring health probe retry behavior and failure thresholds
Tests
- Added comprehensive test coverage for health probe retry and failure management features

coderabbitai · 2026-06-20T17:09:00Z

📝 Walkthrough

Walkthrough

tools/health_check.py is extended with a CircuitBreaker class, retry/exponential-backoff logic, and result aggregation via summarize_results. New CLI flags (--max-retries, --backoff-factor, --circuit-threshold, --circuit-cooldown) are added and wired through run_health_checks. A new test module and an operations documentation subsection are added, plus a generated diagnostic build JSON artifact.

Changes

HTTP Probe Retry and Circuit Breaker

Layer / File(s)	Summary
`CircuitBreaker` abstraction and HTTP probe refactor `tools/health_check.py`	Adds `CircuitBreaker` with `allow_request`, `record_success`, `record_failure`, and `state`; adds helpers `endpoint_key`, `get_circuit_breaker`, `perform_http_request`, `classify_http_status`, `retry_delay`; rewrites `check_http_service` to retry with exponential backoff and short-circuit when the breaker is open. Also expands imports and reformats `INFRASTRUCTURE` config dicts.
Result aggregation, reporting, and CLI wiring `tools/health_check.py`	Adds `summarize_results` to count OK/WARNING/CRITICAL and open circuits; updates `run_health_checks` to accept and pass retry/circuit params, attach circuit state per service, and store `results["summary"]`; updates `print_health_report` to display summary counts and icon-driven (✓/⚠/✗) status output; adds `--max-retries`, `--backoff-factor`, `--circuit-threshold`, `--circuit-cooldown` to `parse_args` and wires them through `main()`.
Unit tests for retry, backoff, circuit breaker, and aggregation `tests/test_health_check_retry_circuit.py`	Tests probe retry-until-success with attempt count in detail; exponential backoff sleep durations with CRITICAL on all-timeout; circuit breaker blocking after threshold; cooldown reset via injectable clock; `summarize_results` aggregation counts; and CLI flag parsing.
Operations documentation `docs/OPERATIONS.md`	Adds subsection documenting retry/backoff/circuit-breaker CLI flags, the backoff delay formula, open-circuit behavior, and output summary counts.

Diagnostic Build Artifact

Layer / File(s)	Summary
Build diagnostic JSON `diagnostic/build-94e0fb08.json`	Generated JSON for commit `94e0fb08` with per-module PASS/FAIL status, timing, captured outputs, and a `pr_note` referencing the encrypted `.logd` artifact.

Sequence Diagram(s)

sequenceDiagram
  participant main
  participant run_health_checks
  participant get_circuit_breaker
  participant CircuitBreaker
  participant check_http_service
  participant perform_http_request
  participant summarize_results

  main->>run_health_checks: max_retries, backoff_factor, circuit_threshold, circuit_cooldown
  run_health_checks->>get_circuit_breaker: endpoint_key(host, port, path)
  get_circuit_breaker-->>run_health_checks: CircuitBreaker instance
  run_health_checks->>check_http_service: ..., circuit_breaker=CB
  check_http_service->>CircuitBreaker: allow_request()
  alt circuit open
    CircuitBreaker-->>check_http_service: False → CRITICAL "circuit open"
  else circuit closed
    loop attempt in range(max_retries+1)
      check_http_service->>perform_http_request: host, port, path, timeout
      alt response received
        perform_http_request-->>check_http_service: status code + body
        check_http_service->>CircuitBreaker: record_success()
        check_http_service-->>run_health_checks: OK/WARNING + detail
      else timeout/error
        check_http_service->>CircuitBreaker: record_failure()
        check_http_service->>check_http_service: sleep(retry_delay(attempt))
      end
    end
    check_http_service-->>run_health_checks: CRITICAL + detail
  end
  run_health_checks->>summarize_results: results dict
  summarize_results-->>run_health_checks: {ok, warning, critical, open_circuits}
  run_health_checks-->>main: results + summary

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

#15 [$35 BOUNTY] Add retry/backoff and circuit breaker to health_check HTTP probes — This PR directly implements all acceptance criteria from that issue: --max-retries, --backoff-factor, --circuit-threshold flags; exponential backoff; circuit breaker with cooldown; summarize_results aggregation; at least 5 new unit tests; and the required diagnostic/build-*.json artifact.

Poem

🐇 Hoppity hop through the network I go,
When servers go dark I retry, nice and slow.
Exponential backoff — the carrot must wait,
The circuit trips open if failures grow great.
After cooldown I peek: is the burrow back bright?
✓ OK, the warren is humming — just right! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely describes the main change: adding retry/backoff and circuit breaker functionality to health checks, which aligns with the primary objective of the changeset.
Description check	✅ Passed	The PR description comprehensively covers all required template sections: Summary, Changes, Testing (with test execution and diagnostic details), and completed Checklist with all items marked.
Linked Issues check	✅ Passed	All coding requirements from issue `#15` are met: retry/backoff with exponential backoff formula implemented, circuit breaker pattern with threshold/cooldown implemented, CLI flags added, result summary aggregation added, WARNING-level logging added, 6 unit tests added exceeding the 5-test requirement, and diagnostic artifacts generated.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to PR objectives: core retry/backoff/circuit-breaker implementation, supporting tests, documentation updates, and diagnostic artifacts. No unrelated cleanup or extraneous changes present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@diagnostic/build-94e0fb08.json`:
- Around line 26-27: The diagnostic JSON file contains absolute file paths in
the artifact field (and potentially in build.log and build-summary.txt fields)
that expose the builder's username and local workspace structure. Locate the
code responsible for generating this diagnostic JSON file and modify it to
convert these absolute paths to repository-relative paths before serializing and
writing the JSON. The conversion should strip the machine-specific prefix and
maintain only the path relative to the repository root, since these fields are
used for display and logging purposes only. Apply this path conversion
consistently to the artifact field on lines 26 and 47 and 54, as well as to any
build.log and build-summary.txt fields.
- Around line 9-10: Remove the plaintext password value from the "password"
field in the diagnostic/build-94e0fb08.json metadata file and redact or remove
the embedded password from the "decrypt_command" field to prevent the secret
from persisting in git history and being visible to reviewers. Additionally,
review and replace any absolute filesystem paths (such as those containing
usernames like "/Users/zhangsiqiang/...") with relative paths or truncated
versions that do not expose system-specific details. Coordinate with maintainers
to modify the CI validation workflow to pass decryption credentials through
GitHub encrypted secrets or environment variables instead of committing them in
the JSON metadata.

In `@tests/test_health_check_retry_circuit.py`:
- Around line 13-14: The HealthCheckRetryCircuitTest class does not isolate its
tests from the global HTTP_CIRCUIT_BREAKERS state that is shared across test
runs. Add a setUp method and/or tearDown method to the
HealthCheckRetryCircuitTest class to clear the HTTP_CIRCUIT_BREAKERS global
state before and/or after each test method executes. This ensures that each test
method in the class starts with a clean circuit breaker state, making test
outcomes independent of execution order when tests are reordered or sharded.

In `@tools/health_check.py`:
- Around line 213-230: The circuit breaker's protection is not being enforced
during retries because the code continues to sleep and retry even after
record_failure() opens the breaker. After each call to breaker.record_failure()
in both the HTTP status check and exception handler blocks, check if the breaker
has opened using an appropriate method (likely something like breaker.is_open()
or similar). If the breaker is open, immediately return with a CRITICAL status
and an appropriate error message instead of allowing the code to continue to the
delay and sleep logic, ensuring the breaker stops the retry attempts once it
opens.
- Around line 483-496: The current if/else logic in the status rendering code
creates a mutual exclusion where parent service checks with a "status" field are
printed but their nested sub-checks are skipped entirely. To fix this,
restructure the code to check for and print the parent status first (when the
check is a dict with "status"), then independently iterate through any nested
sub-checks to print them as well. Replace the else branch that only handles
nested checks with logic that always processes nested items (sub_checks)
regardless of whether the parent check has a status field, ensuring that both
parent statuses and nested sub-checks like certificates are rendered in the
output.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0f1d543-98ca-4580-8224-5fa6bd17f3cf

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and 9019542.

📒 Files selected for processing (5)

diagnostic/build-94e0fb08.json
diagnostic/build-94e0fb08.logd
docs/OPERATIONS.md
tests/test_health_check_retry_circuit.py
tools/health_check.py

coderabbitai · 2026-06-20T17:13:19Z

+  "password": "efcd3c4a127e4b6e294d",
+  "decrypt_command": "encryptly unpack diagnostic/build-94e0fb08.logd <outdir> --password efcd3c4a127e4b6e294d",


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for the diagnostic file and related build configuration git ls-files | grep -E "(diagnostic|build)" | head -20

Repository: thanhle74/kickama

Length of output: 267

🏁 Script executed:

# Check if the diagnostic directory exists and list its contents ls -la diagnostic/ 2>/dev/null || echo "diagnostic directory not found"

Repository: thanhle74/kickama

Length of output: 482

🏁 Script executed:

# Search for build.py or similar build configuration files fd "build\.py" -o fd "\.github" -type d

Repository: thanhle74/kickama

Length of output: 185

🏁 Script executed:

# Read the diagnostic JSON file to verify the password presence cat diagnostic/build-94e0fb08.json

Repository: thanhle74/kickama

Length of output: 5120

🏁 Script executed:

# Read build.py to understand how the artifact is created cat -n build.py | head -100

Repository: thanhle74/kickama

Length of output: 3742

🏁 Script executed:

# Check the CI workflow configuration cat .github/workflows/diagnostic-build-log.yml

Repository: thanhle74/kickama

Length of output: 10279

🏁 Script executed:

# Read more of build.py to understand password generation and encryption wc -l build.py && tail -n +100 build.py | head -200

Repository: thanhle74/kickama

Length of output: 6347

🏁 Script executed:

# Search for password generation logic rg "password" build.py -B 2 -A 2

Repository: thanhle74/kickama

Length of output: 1438

🏁 Script executed:

# Check for any environment variable handling for secrets rg "environ|getenv|secret|token" build.py -i

Repository: thanhle74/kickama

Length of output: 212

Remove plaintext password from committed diagnostic metadata before merging.

The password and decrypt_command embed a bearer secret in the PR that will be visible to all reviewers and persist in git history until artifacts are removed. Although the PR note correctly states these diagnostic artifacts are temporary, the password is required by the CI validation workflow and cannot be omitted from the JSON without redesigning the CI contract to pass secrets differently (e.g., via GitHub encrypted secrets or environment variables). Verify with maintainers whether the CI workflow can be modified to handle decryption credentials outside of committed metadata, or ensure these artifacts are removed immediately after merge.

Additionally, artifact paths expose the builder's absolute filesystem layout (e.g., /Users/zhangsiqiang/Documents/Codex/...). Consider using relative paths or truncating paths to remove usernames and workspace details.

🧰 Tools

🪛 Betterleaks (1.5.0)

[high] 9-9: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@diagnostic/build-94e0fb08.json` around lines 9 - 10, Remove the plaintext password value from the "password" field in the diagnostic/build-94e0fb08.json metadata file and redact or remove the embedded password from the "decrypt_command" field to prevent the secret from persisting in git history and being visible to reviewers. Additionally, review and replace any absolute filesystem paths (such as those containing usernames like "/Users/zhangsiqiang/...") with relative paths or truncated versions that do not expose system-specific details. Coordinate with maintainers to modify the CI validation workflow to pass decryption credentials through GitHub encrypted secrets or environment variables instead of committing them in the JSON metadata.

Source: Linters/SAST tools

coderabbitai · 2026-06-20T17:13:19Z

+      "artifact": "/Users/zhangsiqiang/Documents/Codex/2026-06-19/new-chat/outputs/thanhle-kickama-15-health-retry/frontend/dist",
+      "output": "> tent-frontend@0.0.0 build\n> tsc -b && vite build\n\nvite v6.4.3 building for production...\ntransforming...\n\u2713 100 modules transformed.\nrendering chunks...\ncomputing gzip size...\ndist/index.html                   0.62 kB \u2502 gzip:  0.34 kB\ndist/assets/state-BkjSKDbY.js     8.91 kB \u2502 gzip:  3.55 kB \u2502 map:    57.15 kB\ndist/assets/vendor-CREcWLHI.js   48.93 kB \u2502 gzip: 17.22 kB \u2502 map:   481.27 kB\ndist/assets/index-CyxcoTyU.js   231.32 kB \u2502 gzip: 72.02 kB \u2502 map: 1,044.42 kB\n\u2713 built in 645ms"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /tmp/repository 2>/dev/null || cd . && find . -name "diagnostic" -type d | head -5

Repository: thanhle74/kickama

Length of output: 74

🏁 Script executed:

find . -name "build-94e0fb08.json" -type f

Repository: thanhle74/kickama

Length of output: 94

🏁 Script executed:

cat -n diagnostic/build-94e0fb08.json 2>/dev/null | head -60

Repository: thanhle74/kickama

Length of output: 4600

🏁 Script executed:

grep -r "artifact" --include="*.py" --include="*.js" --include="*.ts" --include="*.json" --include="*.sh" . 2>/dev/null | grep -v "\.git" | grep -v "node_modules" | head -20

Repository: thanhle74/kickama

Length of output: 1884

🏁 Script executed:

grep -r "build-94e0fb08" . 2>/dev/null | grep -v "\.git"

Repository: thanhle74/kickama

Length of output: 601

🏁 Script executed:

ls -la diagnostic/ | head -20

Repository: thanhle74/kickama

Length of output: 482

🏁 Script executed:

grep -B 5 -A 5 '"artifact": binary' build.py

Repository: thanhle74/kickama

Length of output: 419

🏁 Script executed:

grep -B 10 -A 10 'def.*artifact' build.py | head -50

Repository: thanhle74/kickama

Length of output: 847

🏁 Script executed:

cat build.py | head -100

Repository: thanhle74/kickama

Length of output: 3042

🏁 Script executed:

grep -B 20 "artifact.*binary" build.py | head -60

Repository: thanhle74/kickama

Length of output: 2406

🏁 Script executed:

grep -A 5 "def build_module" build.py

Repository: thanhle74/kickama

Length of output: 184

🏁 Script executed:

grep -B 5 -A 15 "for name, success, elapsed, output, binary in results" build.py

Repository: thanhle74/kickama

Length of output: 2755

🏁 Script executed:

grep -B 30 "def build_module" build.py | tail -40

Repository: thanhle74/kickama

Length of output: 777

🏁 Script executed:

grep -A 50 "def build_module" build.py | head -60

Repository: thanhle74/kickama

Length of output: 1925

🏁 Script executed:

rg "binary" build.py -B 2 -A 2

Repository: thanhle74/kickama

Length of output: 2752

🏁 Script executed:

grep -A 15 "def verify_binary" build.py

Repository: thanhle74/kickama

Length of output: 540

Convert artifact paths to repo-relative format to prevent exposing builder username and workspace layout.

The absolute paths (e.g., /Users/zhangsiqiang/Documents/Codex/2026-06-19/new-chat/outputs/...) appear at lines 26, 47, and 54 in this diagnostic file that is committed to the repository. Since the artifact field serves only for display and logging purposes—not for actual file references—convert these to paths relative to the repository root before serializing to JSON (and consider the same for build.log and build-summary.txt). This has minimal implementation cost and eliminates private information from the committed diagnostic record.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@diagnostic/build-94e0fb08.json` around lines 26 - 27, The diagnostic JSON file contains absolute file paths in the artifact field (and potentially in build.log and build-summary.txt fields) that expose the builder's username and local workspace structure. Locate the code responsible for generating this diagnostic JSON file and modify it to convert these absolute paths to repository-relative paths before serializing and writing the JSON. The conversion should strip the machine-specific prefix and maintain only the path relative to the repository root, since these fields are used for display and logging purposes only. Apply this path conversion consistently to the artifact field on lines 26 and 47 and 54, as well as to any build.log and build-summary.txt fields.

coderabbitai · 2026-06-20T17:13:19Z

+class HealthCheckRetryCircuitTest(unittest.TestCase):
+    def test_http_probe_retries_until_success(self) -> None:


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Isolate tests from global circuit-breaker state.

check_http_service reuses HTTP_CIRCUIT_BREAKERS, but this test class never clears it. That makes outcomes order-dependent when tests are reordered or sharded.

Proposed fix

class HealthCheckRetryCircuitTest(unittest.TestCase): + def setUp(self) -> None: + health_check.HTTP_CIRCUIT_BREAKERS.clear() + def test_http_probe_retries_until_success(self) -> None:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_health_check_retry_circuit.py` around lines 13 - 14, The HealthCheckRetryCircuitTest class does not isolate its tests from the global HTTP_CIRCUIT_BREAKERS state that is shared across test runs. Add a setUp method and/or tearDown method to the HealthCheckRetryCircuitTest class to clear the HTTP_CIRCUIT_BREAKERS global state before and/or after each test method executes. This ensures that each test method in the class starts with a clean circuit breaker state, making test outcomes independent of execution order when tests are reordered or sharded.

coderabbitai · 2026-06-20T17:13:19Z

+            breaker.record_failure()
+            last_error = f"HTTP {status}: {body[:100]}"
+            if attempt == attempts_allowed - 1:
+                return "CRITICAL", f"{last_error} after {attempts} attempts", status
+        except Exception as e:
+            breaker.record_failure()
+            last_error = str(e)
+            if attempt == attempts_allowed - 1:
+                return "CRITICAL", f"{last_error} after {attempts} attempts", 0
+
+        delay = retry_delay(backoff_base_delay, backoff_factor, attempt)
+        LOGGER.warning(
+            "HTTP health probe failed for %s; retrying in %.2fs",
+            endpoint_key(host, port, path),
+            delay,
+        )
+        if delay > 0:
+            sleep(delay)


⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Circuit breaker is not enforced once it opens during retries.

After record_failure() opens the breaker, this loop still sleeps and keeps retrying until attempts_allowed is exhausted. That defeats the breaker’s protection goal and can continue hammering a failing endpoint.

Proposed fix

try: status, body = request_once(host, port, path, timeout) if status < 500: breaker.record_success() return classify_http_status(status, body, attempts) breaker.record_failure() last_error = f"HTTP {status}: {body[:100]}" + if not breaker.allow_request(): + LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path)) + return "CRITICAL", f"Circuit breaker open after {attempts} attempts", status if attempt == attempts_allowed - 1: return "CRITICAL", f"{last_error} after {attempts} attempts", status except Exception as e: breaker.record_failure() last_error = str(e) + if not breaker.allow_request(): + LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path)) + return "CRITICAL", f"Circuit breaker open after {attempts} attempts", 0 if attempt == attempts_allowed - 1: return "CRITICAL", f"{last_error} after {attempts} attempts", 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

breaker.record_failure()

last_error = f"HTTP {status}: {body[:100]}"

if attempt == attempts_allowed - 1:

return "CRITICAL", f"{last_error} after {attempts} attempts", status

except Exception as e:

breaker.record_failure()

last_error = str(e)

if attempt == attempts_allowed - 1:

return "CRITICAL", f"{last_error} after {attempts} attempts", 0

delay = retry_delay(backoff_base_delay, backoff_factor, attempt)

LOGGER.warning(

"HTTP health probe failed for %s; retrying in %.2fs",

endpoint_key(host, port, path),

delay,

)

if delay > 0:

sleep(delay)

breaker.record_failure()

last_error = f"HTTP {status}: {body[:100]}"

if not breaker.allow_request():

LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path))

return "CRITICAL", f"Circuit breaker open after {attempts} attempts", status

if attempt == attempts_allowed - 1:

return "CRITICAL", f"{last_error} after {attempts} attempts", status

except Exception as e:

breaker.record_failure()

last_error = str(e)

if not breaker.allow_request():

LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path))

return "CRITICAL", f"Circuit breaker open after {attempts} attempts", 0

if attempt == attempts_allowed - 1:

return "CRITICAL", f"{last_error} after {attempts} attempts", 0

delay = retry_delay(backoff_base_delay, backoff_factor, attempt)

LOGGER.warning(

"HTTP health probe failed for %s; retrying in %.2fs",

endpoint_key(host, port, path),

delay,

)

if delay > 0:

sleep(delay)

🧰 Tools

🪛 Ruff (0.15.17)

[warning] 217-217: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` around lines 213 - 230, The circuit breaker's protection is not being enforced during retries because the code continues to sleep and retry even after record_failure() opens the breaker. After each call to breaker.record_failure() in both the HTTP status check and exception handler blocks, check if the breaker has opened using an appropriate method (likely something like breaker.is_open() or similar). If the breaker is open, immediately return with a CRITICAL status and an appropriate error message instead of allowing the code to continue to the delay and sleep logic, ensuring the breaker stops the retry attempts once it opens.

coderabbitai · 2026-06-20T17:13:19Z

                if isinstance(check, dict) and "status" in check:
-                    status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(check["status"], "?")
+                    status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
+                        check["status"], "?"
+                    )
                    print(f"    {status_icon} {name}: {check['detail']}")
                else:
                    print(f"    {name}:")
                    for sub_name, sub_check in check.items():
                        if isinstance(sub_check, dict) and "status" in sub_check:
-                            sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(sub_check["status"], "?")
+                            sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
+                                sub_check["status"], "?"
+                            )
                            print(f"      {sub_icon} {sub_name}: {sub_check['detail']}")
    print()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Nested service checks (e.g., certificate) are no longer rendered.

For service entries, the "status" branch prints and exits, so nested sub-checks like "certificate" are never shown. Critical cert failures can be hidden from text output.

Proposed fix

if isinstance(check, dict) and "status" in check: status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( check["status"], "?" ) print(f" {status_icon} {name}: {check['detail']}") + for sub_name, sub_check in check.items(): + if sub_name in {"status", "detail", "code", "attempts_allowed", "circuit", "endpoint"}: + continue + if isinstance(sub_check, dict) and "status" in sub_check: + sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( + sub_check["status"], "?" + ) + print(f" {sub_icon} {sub_name}: {sub_check['detail']}") else:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if isinstance(check, dict) and "status" in check:

status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(check["status"], "?")

status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(

check["status"], "?"

)

print(f" {status_icon} {name}: {check['detail']}")

else:

print(f" {name}:")

for sub_name, sub_check in check.items():

if isinstance(sub_check, dict) and "status" in sub_check:

sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(sub_check["status"], "?")

sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(

sub_check["status"], "?"

)

print(f" {sub_icon} {sub_name}: {sub_check['detail']}")

print()

if isinstance(check, dict) and "status" in check:

status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(

check["status"], "?"

)

print(f" {status_icon} {name}: {check['detail']}")

for sub_name, sub_check in check.items():

if sub_name in {"status", "detail", "code", "attempts_allowed", "circuit", "endpoint"}:

continue

if isinstance(sub_check, dict) and "status" in sub_check:

sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(

sub_check["status"], "?"

)

print(f" {sub_icon} {sub_name}: {sub_check['detail']}")

else:

print(f" {name}:")

for sub_name, sub_check in check.items():

if isinstance(sub_check, dict) and "status" in sub_check:

sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(

sub_check["status"], "?"

)

print(f" {sub_icon} {sub_name}: {sub_check['detail']}")

print()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` around lines 483 - 496, The current if/else logic in the status rendering code creates a mutual exclusion where parent service checks with a "status" field are printed but their nested sub-checks are skipped entirely. To fix this, restructure the code to check for and print the parent status first (when the check is a dict with "status"), then independently iterate through any nested sub-checks to print them as well. Replace the else branch that only handles nested checks with logic that always processes nested items (sub_checks) regardless of whether the parent check has a status field, ensuring that both parent statuses and nested sub-checks like certificates are rendered in the output.

添加健康检查重试和熔断

9019542

coderabbitai Bot reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry/backoff and circuit breaker to health checks#17

Add retry/backoff and circuit breaker to health checks#17
zhangsiqiang519 wants to merge 1 commit into
thanhle74:mainfrom
zhangsiqiang519:codex/health-check-retry-circuit-15

zhangsiqiang519 commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

coderabbitai Bot Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		"password": "efcd3c4a127e4b6e294d",
		"decrypt_command": "encryptly unpack diagnostic/build-94e0fb08.logd <outdir> --password efcd3c4a127e4b6e294d",

		"artifact": "/Users/zhangsiqiang/Documents/Codex/2026-06-19/new-chat/outputs/thanhle-kickama-15-health-retry/frontend/dist",
		"output": "> tent-frontend@0.0.0 build\n> tsc -b && vite build\n\nvite v6.4.3 building for production...\ntransforming...\n\u2713 100 modules transformed.\nrendering chunks...\ncomputing gzip size...\ndist/index.html 0.62 kB \u2502 gzip: 0.34 kB\ndist/assets/state-BkjSKDbY.js 8.91 kB \u2502 gzip: 3.55 kB \u2502 map: 57.15 kB\ndist/assets/vendor-CREcWLHI.js 48.93 kB \u2502 gzip: 17.22 kB \u2502 map: 481.27 kB\ndist/assets/index-CyxcoTyU.js 231.32 kB \u2502 gzip: 72.02 kB \u2502 map: 1,044.42 kB\n\u2713 built in 645ms"

		class HealthCheckRetryCircuitTest(unittest.TestCase):
		def test_http_probe_retries_until_success(self) -> None:

Conversation

zhangsiqiang519 commented Jun 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhangsiqiang519 commented Jun 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 20, 2026 •

edited

Loading