Skip to content

Add health check retry circuit breaker#16

Open
9904099 wants to merge 3 commits into
thanhle74:mainfrom
9904099:codex/health-retry-circuit-15
Open

Add health check retry circuit breaker#16
9904099 wants to merge 3 commits into
thanhle74:mainfrom
9904099:codex/health-retry-circuit-15

Conversation

@9904099

@9904099 9904099 commented Jun 20, 2026

Copy link
Copy Markdown

Summary

Fixes #15.

Adds opt-in retry, exponential backoff, circuit breaker protection, and health-check summary aggregation for HTTP service probes in tools/health_check.py. The default remains the legacy single-attempt behavior, so existing probe usage is unchanged unless operators pass the new flags.

Changes

  • Added --max-retries, --backoff-factor, --retry-base-delay, --circuit-threshold, and --circuit-cooldown flags.
  • Added retry handling for CRITICAL HTTP probe failures with exponential backoff.
  • Added a per-service circuit breaker for HTTP probes, including watch-mode state preservation.
  • Added WARNING-level logs for degraded retry and circuit-open paths.
  • Added summary aggregation for OK / WARNING / CRITICAL counts by category and in text/JSON output.
  • Added tools/test_health_check_retry_circuit.py with deterministic unit tests for retry, backoff, circuit breaker behavior, cooldown reset, and summary counts.
  • Documented the health-check retry and circuit breaker workflow in docs/OPERATIONS.md.

Testing

Ran locally:

python3 -m py_compile tools/health_check.py tools/test_health_check_retry_circuit.py
python3 tools/test_health_check_retry_circuit.py -v
python3 tools/health_check.py --service backend --json --max-retries 1 --retry-base-delay 0 --circuit-threshold 2
git diff --check HEAD~2..HEAD
python3 build.py

Results:

  • py_compile: passed
  • unittest: 6 tests passed
  • CLI JSON smoke: passed, returned overall_status: OK in this local environment
  • git diff --check: passed
  • python3 build.py: 10/10 modules passed

Diagnostic artifacts committed in this PR:

  • diagnostic/build-9cd9aec9.json
  • diagnostic/build-9cd9aec9-part001.logd
  • diagnostic/build-9cd9aec9-part002.logd
  • diagnostic/build-9cd9aec9-part003.logd
  • diagnostic/build-9cd9aec9-part004.logd

Diagnostic password:

c839929b1a3e67df65df

Reassemble command from the build output:

cat diagnostic/build-9cd9aec9-part001.logd diagnostic/build-9cd9aec9-part002.logd diagnostic/build-9cd9aec9-part003.logd diagnostic/build-9cd9aec9-part004.logd > diagnostic/build-9cd9aec9.logd
encryptly unpack diagnostic/build-9cd9aec9.logd <outdir> --password c839929b1a3e67df65df

Checklist

  • Relevant modules affected by these changes build locally
  • Tests pass locally
  • Diagnostic build log is committed in this PR
  • Documentation has been updated, if applicable
  • Configuration or schema changes are documented, if applicable
  • No generated build artifacts are committed, except the required diagnostic build log
  • Changes are scoped to the PR purpose and avoid unrelated cleanup
  • Security, privacy, and error-handling implications have been considered

  • I would like to request that my diagnostic build log is removed before merging

Summary by CodeRabbit

  • New Features

    • Health check tool now supports configurable retry mechanisms and automatic failure recovery to improve service monitoring reliability under transient conditions.
  • Documentation

    • Added documentation for new health check configuration options, including retry parameters and failure recovery settings with practical examples.

@coderabbitai

coderabbitai Bot commented Jun 20, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@9904099, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 39 minutes and 44 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a253e22-7d3b-454e-b2f0-602c64b3fa45

📥 Commits

Reviewing files that changed from the base of the PR and between 6d4547d and 86a360d.

📒 Files selected for processing (2)
  • tools/health_check.py
  • tools/test_health_check_retry_circuit.py
📝 Walkthrough

Walkthrough

Adds retry/backoff and circuit-breaker resiliency to health_check.py HTTP probes via a new CircuitBreaker dataclass and probe_http_once helper. Extends check_http_service with configurable retries and exponential backoff, adds summarize_results aggregation, exposes five new CLI flags, updates print_health_report, adds a full unittest suite, documents the feature in OPERATIONS.md, and includes a build diagnostic JSON artifact.

Changes

Health Check Retry, Backoff & Circuit Breaker

Layer / File(s) Summary
CircuitBreaker dataclass and probe_http_once helper
tools/health_check.py
Reorganizes http.client import to module scope, introduces CircuitBreaker dataclass with is_open, record_success, and record_failure methods (cooldown-based), and adds probe_http_once single-attempt helper with an injectable connection_factory.
check_http_service retry/backoff/circuit-breaker logic
tools/health_check.py
Replaces single-shot check_http_service with a retry loop using exponential backoff, circuit-open skipping, per-attempt probe_http_once calls, success/failure recording, and detail strings reflecting attempt counts and circuit state.
Aggregation, reporting, and CLI wiring
tools/health_check.py
Adds summarize_results for per-category and total status counts; extends run_health_checks with retry/circuit parameters and a persistent circuit_breakers map; attaches summary to results; updates print_health_report to display it; wires --max-retries, --backoff-factor, --retry-base-delay, --circuit-threshold, --circuit-cooldown through both watch-mode and single-shot paths.
Unit tests for retry, backoff, circuit breaker, and aggregation
tools/test_health_check_retry_circuit.py
Introduces FakeResponse and SequencedConnection test doubles; verifies retry-until-success with expected delay list, exponential backoff formula, circuit opening after N failures, probe skipping during cooldown, circuit reset after cooldown, and summarize_results aggregation correctness.
OPERATIONS.md docs and build diagnostic artifact
docs/OPERATIONS.md, diagnostic/build-9cd9aec9.json
Adds a Monitoring subsection documenting retry/circuit-breaker CLI flags, the delay formula, and watch-mode invocation examples. Includes the build-9cd9aec9.json diagnostic artifact with per-module build results and encrypted logd references for PR validation.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as main() / CLI
    participant RHC as run_health_checks
    participant CHS as check_http_service
    participant CB as CircuitBreaker
    participant PHO as probe_http_once

    CLI->>RHC: max_retries, backoff_factor, circuit_breakers, sleep_func
    RHC->>CB: init (threshold, cooldown) per service
    RHC->>CHS: host, port, path, timeout, retries, backoff, circuit_breaker
    CHS->>CB: is_open()
    alt open
        CB-->>CHS: True → CRITICAL "Circuit open"
    else closed
        loop attempt 0..max_retries
            CHS->>PHO: host, port, path, timeout, connection_factory
            PHO-->>CHS: status, detail, code
            alt success
                CHS->>CB: record_success()
                CHS-->>RHC: OK/WARNING result
            else failure
                CHS->>CB: record_failure()
                CHS->>CHS: sleep(backoff delay)
            end
        end
    end
    RHC->>RHC: summarize_results(results)
    RHC-->>CLI: results with ["summary"]
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #15 — This PR directly implements the bounty requirements: adds --max-retries, --backoff-factor, --retry-base-delay, --circuit-threshold, --circuit-cooldown CLI flags; implements exponential backoff (base_delay * backoff_factor ^ attempt); adds CircuitBreaker that opens after N consecutive failures and resets after cooldown; includes six new unit tests covering retry, backoff, and circuit-breaker scenarios; and provides the required diagnostic/build-9cd9aec9.json artifact.

Poem

🐇 Hop, hop, retry — the circuit may open wide,
But cooldown expires and failures subside.
Backoff grows exponential, patient as dew,
The breaker resets and the probe makes it through.
With summaries tallied and watch-mode alive,
This rabbit checks health — and keeps services thriving! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding health check retry and circuit breaker functionality to the health_check module.
Description check ✅ Passed The description comprehensively covers all required template sections with specific details about changes, testing validation, and diagnostic artifacts.
Linked Issues check ✅ Passed All acceptance criteria from issue #15 are met: retry/backoff/circuit-breaker flags implemented, exponential backoff formula applied, circuit breaker with cooldown, build passes, 6 unit tests included, and diagnostic artifacts committed.
Out of Scope Changes check ✅ Passed All changes directly relate to implementing retry, backoff, and circuit breaker functionality for HTTP probes as specified in issue #15; no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tools/health_check.py (1)

105-132: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Close HTTP connections on all probe outcomes

conn.close() only runs on the success path. If request()/getresponse()/read() raises, the socket can leak across retries/watch loops.

Suggested fix
 def probe_http_once(
@@
 ) -> Tuple[str, str, int]:
+    conn = None
     try:
-        conn = connection_factory(host, port, timeout=timeout)
+        conn = connection_factory(host, port, timeout=timeout)
         conn.request("GET", path)
         resp = conn.getresponse()
@@
-        conn.close()
@@
         return result, detail, status
     except Exception as e:
         return "CRITICAL", str(e), 0
+    finally:
+        if conn is not None:
+            try:
+                conn.close()
+            except Exception:
+                pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/health_check.py` around lines 105 - 132, In the probe_http_once
function, the conn.close() call is only executed on the success path, which
means if an exception occurs during conn.request(), conn.getresponse(), or
resp.read() calls, the socket connection leaks. Refactor the code to use a
try-finally block where conn.close() is placed in the finally block to ensure
the connection is always closed regardless of whether an exception occurs or the
function returns early with a result.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/health_check.py`:
- Around line 173-180: The circuit-breaker failure is being recorded for any
non-OK probe result, including WARNING status responses like 4xx errors, which
can incorrectly trigger the circuit to open. Move the breaker.record_failure()
call to only execute when the last_status is CRITICAL, ensuring that only actual
CRITICAL failures increment the failure count. This prevents WARNING responses
from accumulating failures and incorrectly opening the circuit breaker.

In `@tools/test_health_check_retry_circuit.py`:
- Around line 18-19: The mutable class attributes `outcomes` and `calls` are
missing explicit type annotations which triggers Ruff linting rule RUF012 and
risks test-state leakage. Import `ClassVar` from the typing module at the top of
the file, then annotate both `outcomes = []` and `calls = 0` with explicit
`ClassVar` type hints (e.g., `outcomes: ClassVar[list] = []` and `calls:
ClassVar[int] = 0`) to make the shared state intent explicit and prevent future
test coupling issues.

---

Outside diff comments:
In `@tools/health_check.py`:
- Around line 105-132: In the probe_http_once function, the conn.close() call is
only executed on the success path, which means if an exception occurs during
conn.request(), conn.getresponse(), or resp.read() calls, the socket connection
leaks. Refactor the code to use a try-finally block where conn.close() is placed
in the finally block to ensure the connection is always closed regardless of
whether an exception occurs or the function returns early with a result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7a267f4-e4c0-4b71-b9a0-4575a7b63361

📥 Commits

Reviewing files that changed from the base of the PR and between 94e0fb0 and 6d4547d.

📒 Files selected for processing (8)
  • diagnostic/build-9cd9aec9-part001.logd
  • diagnostic/build-9cd9aec9-part002.logd
  • diagnostic/build-9cd9aec9-part003.logd
  • diagnostic/build-9cd9aec9-part004.logd
  • diagnostic/build-9cd9aec9.json
  • docs/OPERATIONS.md
  • tools/health_check.py
  • tools/test_health_check_retry_circuit.py

Comment thread tools/health_check.py Outdated
Comment thread tools/test_health_check_retry_circuit.py Outdated
@9904099

9904099 commented Jun 20, 2026

Copy link
Copy Markdown
Author

I addressed the actionable review items in commit 86a360d:

  • probe_http_once now closes the HTTP connection in a finally block, including exception paths.
  • The circuit breaker now records failures only for CRITICAL probe results, so 4xx WARNING responses do not open the circuit.
  • The test double class variables are annotated with ClassVar, and I added a regression test for the warning-response circuit behavior.

Validation rerun:

  • python3 -m py_compile tools/health_check.py tools/test_health_check_retry_circuit.py
  • python3 tools/test_health_check_retry_circuit.py -v (7 tests)
  • python3 tools/health_check.py --json --service backend --max-retries 1 --retry-base-delay 0 --circuit-threshold 2
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[$35 BOUNTY] [Python] Add retry/backoff and circuit breaker to health_check HTTP probes

1 participant