Skip to content

feat(health_check): add retry/backoff/circuit-breaker to HTTP probes (#15)#19

Open
leo202000 wants to merge 2 commits into
thanhle74:mainfrom
leo202000:feat/health-retry-circuit
Open

feat(health_check): add retry/backoff/circuit-breaker to HTTP probes (#15)#19
leo202000 wants to merge 2 commits into
thanhle74:mainfrom
leo202000:feat/health-retry-circuit

Conversation

@leo202000

@leo202000 leo202000 commented Jun 21, 2026

Copy link
Copy Markdown

Summary

Adds configurable retry logic with exponential backoff, a circuit breaker, result aggregation, and WARNING-level logging to the health_check HTTP probes, addressing bounty issue #15.

Changes

  • tools/health_check.py:
    • New CircuitBreaker class: opens after N consecutive failures, resets to a half-open trial state after a cooldown period.
    • New check_http_service_with_retry() with exponential backoff (delay = base_delay * backoff_factor ** attempt), wired into run_health_checks for HTTP service probes.
    • New CLI flags: --max-retries, --backoff-factor, --backoff-base-delay, --circuit-threshold, --circuit-cooldown. Defaults preserve the existing single-attempt behavior.
    • Per-service circuit breakers, a summary block in the results (total/ok/warning/critical/circuit_open), and logger.warning entries for WARNING/CRITICAL checks.
  • tools/test_health_check_retry_circuit.py: 8 unit tests covering retry-until-success, exponential backoff delay sequence, no-retry-on-OK, open-breaker short-circuit, threshold + cooldown reset, and summary aggregation with warning logging.

Testing

  • python3 tools/test_health_check_retry_circuit.py -> 8 tests pass.
  • python3 build.py -> diagnostic artifacts generated (frontend, frailbox, and engine build; remaining modules fail only due to missing toolchains in this build environment).
  • Encrypted diagnostic logd committed at diagnostic/build-d5fa7f48.logd.

Checklist

  • Relevant modules affected by these changes build locally
  • Tests pass locally
  • Diagnostic build log is committed in this PR
  • Documentation has been updated, if applicable (docstrings updated)
  • Changes are scoped to the PR purpose and avoid unrelated cleanup
  • Security, privacy, and error-handling implications have been considered

Closes #15. Please let me know the process for claiming the $35 (LT) bounty once this is merged.

Bounty payout address (USDC on Base): 0x1d03bef9aa379ed932cece384f9f1e72095d5b9e

Summary by CodeRabbit

  • New Features

    • Health check tool now supports automatic retries with exponential backoff for HTTP probes
    • Added circuit breaker pattern to prevent cascading failures during service outages
    • CLI options added to configure retry attempts and circuit breaker behavior
  • Tests

    • Added comprehensive test coverage for retry/backoff and circuit breaker functionality

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

tools/health_check.py gains a CircuitBreaker class, a check_http_service_with_retry function with exponential backoff, an expanded run_health_checks signature with per-service circuit-breaker instantiation, new CLI flags, and a watch-mode wiring fix. A new test file covers all new behaviors. A diagnostic JSON artifact for commit d5fa7f48 is also added.

Changes

Health Check Retry and Circuit Breaker

Layer / File(s) Summary
CircuitBreaker class and retry constants
tools/health_check.py
Adds import logging, default constants for retry/backoff/circuit configuration, and the CircuitBreaker class implementing threshold-based opening, success reset, and cooldown-based half-open state transition.
check_http_service_with_retry
tools/health_check.py
Adds check_http_service_with_retry wrapping check_http_service with exponential-backoff retries, circuit-breaker short-circuit when open, breaker updates on success/failure, and sleeps between attempts.
run_health_checks refactor and summary
tools/health_check.py
Expands run_health_checks signature with retry/backoff/circuit parameters, initializes per-service CircuitBreaker instances, replaces single-attempt HTTP probe with check_http_service_with_retry, unifies tally across service/infrastructure/system/certificate checks, and adds circuit_open to the returned summary.
CLI flags and watch-mode wiring
tools/health_check.py
Adds --max-retries, --backoff-factor, --backoff-base-delay, --circuit-threshold, --circuit-cooldown flags to parse_args, and refactors main's watch loop via a _run() helper to pass all new parameters into run_health_checks.
Unit tests
tools/test_health_check_retry_circuit.py
Adds CircuitBreakerTests, RetryBackoffTests, and AggregationTests covering state transitions, exponential sleep growth, short-circuit behavior, summary field correctness, WARNING log content, and probe call counts.

Diagnostic Build Artifact

Layer / File(s) Summary
Diagnostic JSON
diagnostic/build-d5fa7f48.json
Adds the full diagnostic JSON for commit d5fa7f48 with aggregate pass/fail counts (3 passed, 7 failed), per-module build output and error text, encrypted log artifact path, decryption command with embedded password, and a pr_note.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / main()
    participant RHC as run_health_checks
    participant CB as CircuitBreaker
    participant Retry as check_http_service_with_retry
    participant Probe as check_http_service

    CLI->>RHC: run_health_checks(max_retries, backoff_factor, circuit_threshold, circuit_cooldown)
    RHC->>CB: CircuitBreaker(threshold, cooldown) per service
    loop each service
        RHC->>Retry: check_http_service_with_retry(host, port, path, breaker)
        Retry->>CB: is_open()?
        alt open
            CB-->>Retry: True
            Retry-->>RHC: CRITICAL, "circuit breaker open"
        else closed / half-open
            loop up to max_retries
                Retry->>Probe: check_http_service(host, port, path, timeout)
                Probe-->>Retry: status, detail, code
                alt OK
                    Retry->>CB: record_success()
                    Retry-->>RHC: OK result
                else non-OK
                    Retry->>CB: record_failure()
                    Retry->>Retry: sleep(base_delay * backoff_factor^attempt)
                end
            end
            Retry-->>RHC: last result
        end
        RHC->>RHC: tally(ok/warning/critical)
    end
    RHC-->>CLI: results with summary.circuit_open
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #15 [$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes — This PR directly implements all acceptance criteria: --max-retries, --backoff-factor, --circuit-threshold CLI flags, exponential backoff formula, circuit-breaker open/cooldown logic, a passing diagnostic artifact, and at least 5 unit tests covering retry/backoff/circuit-breaker scenarios.

🐇 A rabbit who checks every hop,
Now backs off when services drop!
The circuit swings open — no fuss,
Retries with delay, earning trust.
With logs full of WARNING and grace,
Health checks now run at a sensible pace! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the primary change: adding retry/backoff/circuit-breaker functionality to HTTP probes in health_check.
Description check ✅ Passed The description is comprehensive and covers all required template sections: summary, changes, testing, and a complete checklist with items marked as done.
Linked Issues check ✅ Passed All acceptance criteria from issue #15 are met: retry/backoff/circuit-breaker implementation, exponential backoff formula, CLI flags, at least 5 unit tests, and diagnostic artifacts committed.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #15 requirements: health_check.py modifications, new test file, and diagnostic artifacts with no unrelated cleanup.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Warning

⚠️ This pull request has been flagged as potential spam (promotional) by CodeRabbit slop detection and should be reviewed carefully.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
tools/health_check.py (1)

354-354: 💤 Low value

Prefix unused unpacked variables with underscore.

The variables disk_pct, mem_pct, and load_val are unpacked but never used. Prefix them with _ to signal intent and silence the linter.

Suggested fix
-    disk_status, disk_detail, disk_pct = check_disk_usage()
+    disk_status, disk_detail, _disk_pct = check_disk_usage()
     # ...
-    mem_status, mem_detail, mem_pct = check_memory_usage()
+    mem_status, mem_detail, _mem_pct = check_memory_usage()
     # ...
-    load_status, load_detail, load_val = check_load_average()
+    load_status, load_detail, _load_val = check_load_average()

Also applies to: 362-362, 370-370

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` at line 354, The variables disk_pct, mem_pct, and
load_val are unpacked in the check_disk_usage(), check_memory_usage(), and
check_load_average() function calls respectively but are never used in the code.
Prefix each of these unused variables with an underscore to signal intent and
satisfy linter requirements. Replace disk_pct with _disk_pct, mem_pct with
_mem_pct, and load_val with _load_val in their respective unpacking statements.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/health_check.py`:
- Line 298: The circuit breaker state is lost between watch loop iterations
because the breakers dictionary is instantiated fresh on every call to
run_health_checks() at line 298. To preserve circuit breaker state across
iterations, modify the run_health_checks() function to accept an optional
breakers parameter that defaults to an empty dict if not provided, then ensure
the breakers dict is created once in the main() or _run() function outside the
watch loop and passed into run_health_checks() on each iteration. This way the
circuit breaker state (consecutive failures, open/half-open states) will persist
across multiple health check cycles in watch mode.

In `@tools/test_health_check_retry_circuit.py`:
- Around line 45-54: In the test_retry_until_success method, the variable
`detail` is unpacked from the return value of hc.check_http_service_with_retry()
but is never used in the test. Replace the unused variable name `detail` with
`_` in the unpacking assignment to follow Python convention and signal that this
value is intentionally discarded.
- Line 1: The file test_health_check_retry_circuit.py has a Byte Order Mark
(BOM) character before the shebang line, which prevents Unix/Linux systems from
properly recognizing the shebang directive since the kernel expects `#!` to be
at the very first byte. Re-save the file using UTF-8 encoding without BOM to
ensure the shebang starts at byte 0. Most text editors have an option to save as
"UTF-8" or "UTF-8 without BOM" which will remove the BOM character and allow the
script to be executed directly on Unix/Linux systems.
- Around line 74-83: In the test_open_breaker_short_circuits method, the
variable assignment unpacks three values from the
hc.check_http_service_with_retry call but the code variable is never used
afterward. Replace the code variable name with a single underscore (_) in the
tuple unpacking to follow Python convention and signal that this value is
intentionally discarded.

---

Nitpick comments:
In `@tools/health_check.py`:
- Line 354: The variables disk_pct, mem_pct, and load_val are unpacked in the
check_disk_usage(), check_memory_usage(), and check_load_average() function
calls respectively but are never used in the code. Prefix each of these unused
variables with an underscore to signal intent and satisfy linter requirements.
Replace disk_pct with _disk_pct, mem_pct with _mem_pct, and load_val with
_load_val in their respective unpacking statements.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41e80687-9bc2-4e22-892e-32e600cafa42

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and d98bbd5.

📒 Files selected for processing (4)
  • diagnostic/build-d5fa7f48.json
  • diagnostic/build-d5fa7f48.logd
  • tools/health_check.py
  • tools/test_health_check_retry_circuit.py

Comment thread tools/health_check.py
}

all_ok = True
breakers: Dict[str, CircuitBreaker] = {}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Circuit breaker state is lost between watch iterations.

In watch mode, _run() is called repeatedly in a loop, and each call to run_health_checks() creates a fresh breakers dict (line 298). This means circuit breaker state (consecutive failures, open/half-open state) doesn't persist across iterations, defeating the purpose of the circuit breaker pattern.

To preserve state across iterations, the breakers dict should be instantiated once outside the watch loop and passed into run_health_checks(), or stored at module level.

Proposed fix

One approach is to accept an optional breakers parameter:

 def run_health_checks(service: Optional[str] = None, json_output: bool = False,
                       max_retries: int = DEFAULT_MAX_RETRIES,
                       backoff_factor: float = DEFAULT_BACKOFF_FACTOR,
                       base_delay: float = DEFAULT_BACKOFF_BASE_DELAY,
                       circuit_threshold: int = DEFAULT_CIRCUIT_THRESHOLD,
-                      circuit_cooldown: float = DEFAULT_CIRCUIT_COOLDOWN) -> Dict[str, Any]:
+                      circuit_cooldown: float = DEFAULT_CIRCUIT_COOLDOWN,
+                      breakers: Optional[Dict[str, CircuitBreaker]] = None) -> Dict[str, Any]:
     # ...
     all_ok = True
-    breakers: Dict[str, CircuitBreaker] = {}
+    if breakers is None:
+        breakers = {}

Then in main():

+    breakers: Dict[str, hc.CircuitBreaker] = {}
+
     def _run() -> Dict[str, Any]:
         return run_health_checks(
             args.service, args.json,
             max_retries=args.max_retries,
             backoff_factor=args.backoff_factor,
             base_delay=args.backoff_base_delay,
             circuit_threshold=args.circuit_threshold,
             circuit_cooldown=args.circuit_cooldown,
+            breakers=breakers,
         )

Also applies to: 469-473

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` at line 298, The circuit breaker state is lost between
watch loop iterations because the breakers dictionary is instantiated fresh on
every call to run_health_checks() at line 298. To preserve circuit breaker state
across iterations, modify the run_health_checks() function to accept an optional
breakers parameter that defaults to an empty dict if not provided, then ensure
the breakers dict is created once in the main() or _run() function outside the
watch loop and passed into run_health_checks() on each iteration. This way the
circuit breaker state (consecutive failures, open/half-open states) will persist
across multiple health check cycles in watch mode.

@@ -0,0 +1,112 @@
#!/usr/bin/env python3

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove the BOM character before the shebang.

A Byte Order Mark (U+FEFF) character precedes the shebang, which can cause the script to fail on Unix/Linux systems where the kernel looks for #! at the very first byte. This prevents the file from being executed directly.

🐛 Proposed fix

Re-save the file with UTF-8 encoding without BOM, ensuring the shebang starts at byte 0:

-#!/usr/bin/env python3
+#!/usr/bin/env python3

Most editors have an option to save as "UTF-8" or "UTF-8 without BOM".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/usr/bin/env python3
#!/usr/bin/env python3
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 1-1: Shebang should be at the beginning of the file

(EXE005)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check_retry_circuit.py` at line 1, The file
test_health_check_retry_circuit.py has a Byte Order Mark (BOM) character before
the shebang line, which prevents Unix/Linux systems from properly recognizing
the shebang directive since the kernel expects `#!` to be at the very first
byte. Re-save the file using UTF-8 encoding without BOM to ensure the shebang
starts at byte 0. Most text editors have an option to save as "UTF-8" or "UTF-8
without BOM" which will remove the BOM character and allow the script to be
executed directly on Unix/Linux systems.

Source: Linters/SAST tools

Comment on lines +45 to +54
def test_retry_until_success(self):
calls = [("CRITICAL", "down", 0), ("OK", "up", 200)]
with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \
mock.patch("health_check.time.sleep") as sleep:
status, detail, code = hc.check_http_service_with_retry(
"h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1)
self.assertEqual(status, "OK")
self.assertEqual(code, 200)
self.assertEqual(probe.call_count, 2)
sleep.assert_called_once_with(0.1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use _ for unused unpacked variable.

The detail variable is unpacked but never referenced. Use _ to signal intentional discard.

♻️ Proposed fix
         with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \
              mock.patch("health_check.time.sleep") as sleep:
             status, detail, code = hc.check_http_service_with_retry(
                 "h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1)
         self.assertEqual(status, "OK")
         self.assertEqual(code, 200)

should be:

         with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \
              mock.patch("health_check.time.sleep") as sleep:
-            status, detail, code = hc.check_http_service_with_retry(
+            status, _, code = hc.check_http_service_with_retry(
                 "h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1)
         self.assertEqual(status, "OK")
         self.assertEqual(code, 200)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_retry_until_success(self):
calls = [("CRITICAL", "down", 0), ("OK", "up", 200)]
with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \
mock.patch("health_check.time.sleep") as sleep:
status, detail, code = hc.check_http_service_with_retry(
"h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1)
self.assertEqual(status, "OK")
self.assertEqual(code, 200)
self.assertEqual(probe.call_count, 2)
sleep.assert_called_once_with(0.1)
def test_retry_until_success(self):
calls = [("CRITICAL", "down", 0), ("OK", "up", 200)]
with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \
mock.patch("health_check.time.sleep") as sleep:
status, _, code = hc.check_http_service_with_retry(
"h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1)
self.assertEqual(status, "OK")
self.assertEqual(code, 200)
self.assertEqual(probe.call_count, 2)
sleep.assert_called_once_with(0.1)
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 49-49: Unpacked variable detail is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check_retry_circuit.py` around lines 45 - 54, In the
test_retry_until_success method, the variable `detail` is unpacked from the
return value of hc.check_http_service_with_retry() but is never used in the
test. Replace the unused variable name `detail` with `_` in the unpacking
assignment to follow Python convention and signal that this value is
intentionally discarded.

Source: Linters/SAST tools

Comment on lines +74 to +83
def test_open_breaker_short_circuits(self):
cb = hc.CircuitBreaker(threshold=1, cooldown=1000.0)
cb.record_failure()
self.assertTrue(cb.is_open())
with mock.patch("health_check.check_http_service") as probe:
status, detail, code = hc.check_http_service_with_retry(
"h", 80, "/", 5, max_retries=2, breaker=cb)
self.assertEqual(status, "CRITICAL")
self.assertIn("Circuit breaker", detail)
self.assertEqual(probe.call_count, 0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use _ for unused unpacked variable.

The code variable is unpacked but never referenced. Use _ to signal intentional discard.

♻️ Proposed fix
         with mock.patch("health_check.check_http_service") as probe:
-            status, detail, code = hc.check_http_service_with_retry(
+            status, detail, _ = hc.check_http_service_with_retry(
                 "h", 80, "/", 5, max_retries=2, breaker=cb)
         self.assertEqual(status, "CRITICAL")
         self.assertIn("Circuit breaker", detail)
🧰 Tools
🪛 Ruff (0.15.17)

[warning] 79-79: Unpacked variable code is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/test_health_check_retry_circuit.py` around lines 74 - 83, In the
test_open_breaker_short_circuits method, the variable assignment unpacks three
values from the hc.check_http_service_with_retry call but the code variable is
never used afterward. Replace the code variable name with a single underscore
(_) in the tuple unpacking to follow Python convention and signal that this
value is intentionally discarded.

Source: Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes

1 participant