feat(health_check): add retry/backoff/circuit-breaker to HTTP probes (#15)#19
feat(health_check): add retry/backoff/circuit-breaker to HTTP probes (#15)#19leo202000 wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthrough
ChangesHealth Check Retry and Circuit Breaker
Diagnostic Build Artifact
Sequence DiagramsequenceDiagram
participant CLI as CLI / main()
participant RHC as run_health_checks
participant CB as CircuitBreaker
participant Retry as check_http_service_with_retry
participant Probe as check_http_service
CLI->>RHC: run_health_checks(max_retries, backoff_factor, circuit_threshold, circuit_cooldown)
RHC->>CB: CircuitBreaker(threshold, cooldown) per service
loop each service
RHC->>Retry: check_http_service_with_retry(host, port, path, breaker)
Retry->>CB: is_open()?
alt open
CB-->>Retry: True
Retry-->>RHC: CRITICAL, "circuit breaker open"
else closed / half-open
loop up to max_retries
Retry->>Probe: check_http_service(host, port, path, timeout)
Probe-->>Retry: status, detail, code
alt OK
Retry->>CB: record_success()
Retry-->>RHC: OK result
else non-OK
Retry->>CB: record_failure()
Retry->>Retry: sleep(base_delay * backoff_factor^attempt)
end
end
Retry-->>RHC: last result
end
RHC->>RHC: tally(ok/warning/critical)
end
RHC-->>CLI: results with summary.circuit_open
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Warning |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tools/health_check.py (1)
354-354: 💤 Low valuePrefix unused unpacked variables with underscore.
The variables
disk_pct,mem_pct, andload_valare unpacked but never used. Prefix them with_to signal intent and silence the linter.Suggested fix
- disk_status, disk_detail, disk_pct = check_disk_usage() + disk_status, disk_detail, _disk_pct = check_disk_usage() # ... - mem_status, mem_detail, mem_pct = check_memory_usage() + mem_status, mem_detail, _mem_pct = check_memory_usage() # ... - load_status, load_detail, load_val = check_load_average() + load_status, load_detail, _load_val = check_load_average()Also applies to: 362-362, 370-370
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/health_check.py` at line 354, The variables disk_pct, mem_pct, and load_val are unpacked in the check_disk_usage(), check_memory_usage(), and check_load_average() function calls respectively but are never used in the code. Prefix each of these unused variables with an underscore to signal intent and satisfy linter requirements. Replace disk_pct with _disk_pct, mem_pct with _mem_pct, and load_val with _load_val in their respective unpacking statements.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/health_check.py`:
- Line 298: The circuit breaker state is lost between watch loop iterations
because the breakers dictionary is instantiated fresh on every call to
run_health_checks() at line 298. To preserve circuit breaker state across
iterations, modify the run_health_checks() function to accept an optional
breakers parameter that defaults to an empty dict if not provided, then ensure
the breakers dict is created once in the main() or _run() function outside the
watch loop and passed into run_health_checks() on each iteration. This way the
circuit breaker state (consecutive failures, open/half-open states) will persist
across multiple health check cycles in watch mode.
In `@tools/test_health_check_retry_circuit.py`:
- Around line 45-54: In the test_retry_until_success method, the variable
`detail` is unpacked from the return value of hc.check_http_service_with_retry()
but is never used in the test. Replace the unused variable name `detail` with
`_` in the unpacking assignment to follow Python convention and signal that this
value is intentionally discarded.
- Line 1: The file test_health_check_retry_circuit.py has a Byte Order Mark
(BOM) character before the shebang line, which prevents Unix/Linux systems from
properly recognizing the shebang directive since the kernel expects `#!` to be
at the very first byte. Re-save the file using UTF-8 encoding without BOM to
ensure the shebang starts at byte 0. Most text editors have an option to save as
"UTF-8" or "UTF-8 without BOM" which will remove the BOM character and allow the
script to be executed directly on Unix/Linux systems.
- Around line 74-83: In the test_open_breaker_short_circuits method, the
variable assignment unpacks three values from the
hc.check_http_service_with_retry call but the code variable is never used
afterward. Replace the code variable name with a single underscore (_) in the
tuple unpacking to follow Python convention and signal that this value is
intentionally discarded.
---
Nitpick comments:
In `@tools/health_check.py`:
- Line 354: The variables disk_pct, mem_pct, and load_val are unpacked in the
check_disk_usage(), check_memory_usage(), and check_load_average() function
calls respectively but are never used in the code. Prefix each of these unused
variables with an underscore to signal intent and satisfy linter requirements.
Replace disk_pct with _disk_pct, mem_pct with _mem_pct, and load_val with
_load_val in their respective unpacking statements.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 41e80687-9bc2-4e22-892e-32e600cafa42
📒 Files selected for processing (4)
diagnostic/build-d5fa7f48.jsondiagnostic/build-d5fa7f48.logdtools/health_check.pytools/test_health_check_retry_circuit.py
| } | ||
|
|
||
| all_ok = True | ||
| breakers: Dict[str, CircuitBreaker] = {} |
There was a problem hiding this comment.
Circuit breaker state is lost between watch iterations.
In watch mode, _run() is called repeatedly in a loop, and each call to run_health_checks() creates a fresh breakers dict (line 298). This means circuit breaker state (consecutive failures, open/half-open state) doesn't persist across iterations, defeating the purpose of the circuit breaker pattern.
To preserve state across iterations, the breakers dict should be instantiated once outside the watch loop and passed into run_health_checks(), or stored at module level.
Proposed fix
One approach is to accept an optional breakers parameter:
def run_health_checks(service: Optional[str] = None, json_output: bool = False,
max_retries: int = DEFAULT_MAX_RETRIES,
backoff_factor: float = DEFAULT_BACKOFF_FACTOR,
base_delay: float = DEFAULT_BACKOFF_BASE_DELAY,
circuit_threshold: int = DEFAULT_CIRCUIT_THRESHOLD,
- circuit_cooldown: float = DEFAULT_CIRCUIT_COOLDOWN) -> Dict[str, Any]:
+ circuit_cooldown: float = DEFAULT_CIRCUIT_COOLDOWN,
+ breakers: Optional[Dict[str, CircuitBreaker]] = None) -> Dict[str, Any]:
# ...
all_ok = True
- breakers: Dict[str, CircuitBreaker] = {}
+ if breakers is None:
+ breakers = {}Then in main():
+ breakers: Dict[str, hc.CircuitBreaker] = {}
+
def _run() -> Dict[str, Any]:
return run_health_checks(
args.service, args.json,
max_retries=args.max_retries,
backoff_factor=args.backoff_factor,
base_delay=args.backoff_base_delay,
circuit_threshold=args.circuit_threshold,
circuit_cooldown=args.circuit_cooldown,
+ breakers=breakers,
)Also applies to: 469-473
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` at line 298, The circuit breaker state is lost between
watch loop iterations because the breakers dictionary is instantiated fresh on
every call to run_health_checks() at line 298. To preserve circuit breaker state
across iterations, modify the run_health_checks() function to accept an optional
breakers parameter that defaults to an empty dict if not provided, then ensure
the breakers dict is created once in the main() or _run() function outside the
watch loop and passed into run_health_checks() on each iteration. This way the
circuit breaker state (consecutive failures, open/half-open states) will persist
across multiple health check cycles in watch mode.
| @@ -0,0 +1,112 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Remove the BOM character before the shebang.
A Byte Order Mark (U+FEFF) character precedes the shebang, which can cause the script to fail on Unix/Linux systems where the kernel looks for #! at the very first byte. This prevents the file from being executed directly.
🐛 Proposed fix
Re-save the file with UTF-8 encoding without BOM, ensuring the shebang starts at byte 0:
-#!/usr/bin/env python3
+#!/usr/bin/env python3Most editors have an option to save as "UTF-8" or "UTF-8 without BOM".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/usr/bin/env python3 | |
| #!/usr/bin/env python3 |
🧰 Tools
🪛 Ruff (0.15.17)
[warning] 1-1: Shebang should be at the beginning of the file
(EXE005)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/test_health_check_retry_circuit.py` at line 1, The file
test_health_check_retry_circuit.py has a Byte Order Mark (BOM) character before
the shebang line, which prevents Unix/Linux systems from properly recognizing
the shebang directive since the kernel expects `#!` to be at the very first
byte. Re-save the file using UTF-8 encoding without BOM to ensure the shebang
starts at byte 0. Most text editors have an option to save as "UTF-8" or "UTF-8
without BOM" which will remove the BOM character and allow the script to be
executed directly on Unix/Linux systems.
Source: Linters/SAST tools
| def test_retry_until_success(self): | ||
| calls = [("CRITICAL", "down", 0), ("OK", "up", 200)] | ||
| with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \ | ||
| mock.patch("health_check.time.sleep") as sleep: | ||
| status, detail, code = hc.check_http_service_with_retry( | ||
| "h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1) | ||
| self.assertEqual(status, "OK") | ||
| self.assertEqual(code, 200) | ||
| self.assertEqual(probe.call_count, 2) | ||
| sleep.assert_called_once_with(0.1) |
There was a problem hiding this comment.
Use _ for unused unpacked variable.
The detail variable is unpacked but never referenced. Use _ to signal intentional discard.
♻️ Proposed fix
with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \
mock.patch("health_check.time.sleep") as sleep:
status, detail, code = hc.check_http_service_with_retry(
"h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1)
self.assertEqual(status, "OK")
self.assertEqual(code, 200)should be:
with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \
mock.patch("health_check.time.sleep") as sleep:
- status, detail, code = hc.check_http_service_with_retry(
+ status, _, code = hc.check_http_service_with_retry(
"h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1)
self.assertEqual(status, "OK")
self.assertEqual(code, 200)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_retry_until_success(self): | |
| calls = [("CRITICAL", "down", 0), ("OK", "up", 200)] | |
| with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \ | |
| mock.patch("health_check.time.sleep") as sleep: | |
| status, detail, code = hc.check_http_service_with_retry( | |
| "h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1) | |
| self.assertEqual(status, "OK") | |
| self.assertEqual(code, 200) | |
| self.assertEqual(probe.call_count, 2) | |
| sleep.assert_called_once_with(0.1) | |
| def test_retry_until_success(self): | |
| calls = [("CRITICAL", "down", 0), ("OK", "up", 200)] | |
| with mock.patch("health_check.check_http_service", side_effect=calls) as probe, \ | |
| mock.patch("health_check.time.sleep") as sleep: | |
| status, _, code = hc.check_http_service_with_retry( | |
| "h", 80, "/", 5, max_retries=2, backoff_factor=2.0, base_delay=0.1) | |
| self.assertEqual(status, "OK") | |
| self.assertEqual(code, 200) | |
| self.assertEqual(probe.call_count, 2) | |
| sleep.assert_called_once_with(0.1) |
🧰 Tools
🪛 Ruff (0.15.17)
[warning] 49-49: Unpacked variable detail is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/test_health_check_retry_circuit.py` around lines 45 - 54, In the
test_retry_until_success method, the variable `detail` is unpacked from the
return value of hc.check_http_service_with_retry() but is never used in the
test. Replace the unused variable name `detail` with `_` in the unpacking
assignment to follow Python convention and signal that this value is
intentionally discarded.
Source: Linters/SAST tools
| def test_open_breaker_short_circuits(self): | ||
| cb = hc.CircuitBreaker(threshold=1, cooldown=1000.0) | ||
| cb.record_failure() | ||
| self.assertTrue(cb.is_open()) | ||
| with mock.patch("health_check.check_http_service") as probe: | ||
| status, detail, code = hc.check_http_service_with_retry( | ||
| "h", 80, "/", 5, max_retries=2, breaker=cb) | ||
| self.assertEqual(status, "CRITICAL") | ||
| self.assertIn("Circuit breaker", detail) | ||
| self.assertEqual(probe.call_count, 0) |
There was a problem hiding this comment.
Use _ for unused unpacked variable.
The code variable is unpacked but never referenced. Use _ to signal intentional discard.
♻️ Proposed fix
with mock.patch("health_check.check_http_service") as probe:
- status, detail, code = hc.check_http_service_with_retry(
+ status, detail, _ = hc.check_http_service_with_retry(
"h", 80, "/", 5, max_retries=2, breaker=cb)
self.assertEqual(status, "CRITICAL")
self.assertIn("Circuit breaker", detail)🧰 Tools
🪛 Ruff (0.15.17)
[warning] 79-79: Unpacked variable code is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/test_health_check_retry_circuit.py` around lines 74 - 83, In the
test_open_breaker_short_circuits method, the variable assignment unpacks three
values from the hc.check_http_service_with_retry call but the code variable is
never used afterward. Replace the code variable name with a single underscore
(_) in the tuple unpacking to follow Python convention and signal that this
value is intentionally discarded.
Source: Linters/SAST tools
Summary
Adds configurable retry logic with exponential backoff, a circuit breaker, result aggregation, and WARNING-level logging to the
health_checkHTTP probes, addressing bounty issue #15.Changes
tools/health_check.py:CircuitBreakerclass: opens after N consecutive failures, resets to a half-open trial state after a cooldown period.check_http_service_with_retry()with exponential backoff (delay = base_delay * backoff_factor ** attempt), wired intorun_health_checksfor HTTP service probes.--max-retries,--backoff-factor,--backoff-base-delay,--circuit-threshold,--circuit-cooldown. Defaults preserve the existing single-attempt behavior.summaryblock in the results (total/ok/warning/critical/circuit_open), andlogger.warningentries for WARNING/CRITICAL checks.tools/test_health_check_retry_circuit.py: 8 unit tests covering retry-until-success, exponential backoff delay sequence, no-retry-on-OK, open-breaker short-circuit, threshold + cooldown reset, and summary aggregation with warning logging.Testing
python3 tools/test_health_check_retry_circuit.py-> 8 tests pass.python3 build.py-> diagnostic artifacts generated (frontend, frailbox, and engine build; remaining modules fail only due to missing toolchains in this build environment).diagnostic/build-d5fa7f48.logd.Checklist
Closes #15. Please let me know the process for claiming the $35 (LT) bounty once this is merged.
Bounty payout address (USDC on Base):
0x1d03bef9aa379ed932cece384f9f1e72095d5b9eSummary by CodeRabbit
New Features
Tests