Add retry/backoff and circuit breaker to health checks#17
Add retry/backoff and circuit breaker to health checks#17zhangsiqiang519 wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthrough
ChangesHTTP Probe Retry and Circuit Breaker
Diagnostic Build Artifact
Sequence Diagram(s)sequenceDiagram
participant main
participant run_health_checks
participant get_circuit_breaker
participant CircuitBreaker
participant check_http_service
participant perform_http_request
participant summarize_results
main->>run_health_checks: max_retries, backoff_factor, circuit_threshold, circuit_cooldown
run_health_checks->>get_circuit_breaker: endpoint_key(host, port, path)
get_circuit_breaker-->>run_health_checks: CircuitBreaker instance
run_health_checks->>check_http_service: ..., circuit_breaker=CB
check_http_service->>CircuitBreaker: allow_request()
alt circuit open
CircuitBreaker-->>check_http_service: False → CRITICAL "circuit open"
else circuit closed
loop attempt in range(max_retries+1)
check_http_service->>perform_http_request: host, port, path, timeout
alt response received
perform_http_request-->>check_http_service: status code + body
check_http_service->>CircuitBreaker: record_success()
check_http_service-->>run_health_checks: OK/WARNING + detail
else timeout/error
check_http_service->>CircuitBreaker: record_failure()
check_http_service->>check_http_service: sleep(retry_delay(attempt))
end
end
check_http_service-->>run_health_checks: CRITICAL + detail
end
run_health_checks->>summarize_results: results dict
summarize_results-->>run_health_checks: {ok, warning, critical, open_circuits}
run_health_checks-->>main: results + summary
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@diagnostic/build-94e0fb08.json`:
- Around line 26-27: The diagnostic JSON file contains absolute file paths in
the artifact field (and potentially in build.log and build-summary.txt fields)
that expose the builder's username and local workspace structure. Locate the
code responsible for generating this diagnostic JSON file and modify it to
convert these absolute paths to repository-relative paths before serializing and
writing the JSON. The conversion should strip the machine-specific prefix and
maintain only the path relative to the repository root, since these fields are
used for display and logging purposes only. Apply this path conversion
consistently to the artifact field on lines 26 and 47 and 54, as well as to any
build.log and build-summary.txt fields.
- Around line 9-10: Remove the plaintext password value from the "password"
field in the diagnostic/build-94e0fb08.json metadata file and redact or remove
the embedded password from the "decrypt_command" field to prevent the secret
from persisting in git history and being visible to reviewers. Additionally,
review and replace any absolute filesystem paths (such as those containing
usernames like "/Users/zhangsiqiang/...") with relative paths or truncated
versions that do not expose system-specific details. Coordinate with maintainers
to modify the CI validation workflow to pass decryption credentials through
GitHub encrypted secrets or environment variables instead of committing them in
the JSON metadata.
In `@tests/test_health_check_retry_circuit.py`:
- Around line 13-14: The HealthCheckRetryCircuitTest class does not isolate its
tests from the global HTTP_CIRCUIT_BREAKERS state that is shared across test
runs. Add a setUp method and/or tearDown method to the
HealthCheckRetryCircuitTest class to clear the HTTP_CIRCUIT_BREAKERS global
state before and/or after each test method executes. This ensures that each test
method in the class starts with a clean circuit breaker state, making test
outcomes independent of execution order when tests are reordered or sharded.
In `@tools/health_check.py`:
- Around line 213-230: The circuit breaker's protection is not being enforced
during retries because the code continues to sleep and retry even after
record_failure() opens the breaker. After each call to breaker.record_failure()
in both the HTTP status check and exception handler blocks, check if the breaker
has opened using an appropriate method (likely something like breaker.is_open()
or similar). If the breaker is open, immediately return with a CRITICAL status
and an appropriate error message instead of allowing the code to continue to the
delay and sleep logic, ensuring the breaker stops the retry attempts once it
opens.
- Around line 483-496: The current if/else logic in the status rendering code
creates a mutual exclusion where parent service checks with a "status" field are
printed but their nested sub-checks are skipped entirely. To fix this,
restructure the code to check for and print the parent status first (when the
check is a dict with "status"), then independently iterate through any nested
sub-checks to print them as well. Replace the else branch that only handles
nested checks with logic that always processes nested items (sub_checks)
regardless of whether the parent check has a status field, ensuring that both
parent statuses and nested sub-checks like certificates are rendered in the
output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d0f1d543-98ca-4580-8224-5fa6bd17f3cf
📒 Files selected for processing (5)
diagnostic/build-94e0fb08.jsondiagnostic/build-94e0fb08.logddocs/OPERATIONS.mdtests/test_health_check_retry_circuit.pytools/health_check.py
| "password": "efcd3c4a127e4b6e294d", | ||
| "decrypt_command": "encryptly unpack diagnostic/build-94e0fb08.logd <outdir> --password efcd3c4a127e4b6e294d", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for the diagnostic file and related build configuration
git ls-files | grep -E "(diagnostic|build)" | head -20Repository: thanhle74/kickama
Length of output: 267
🏁 Script executed:
# Check if the diagnostic directory exists and list its contents
ls -la diagnostic/ 2>/dev/null || echo "diagnostic directory not found"Repository: thanhle74/kickama
Length of output: 482
🏁 Script executed:
# Search for build.py or similar build configuration files
fd "build\.py" -o fd "\.github" -type dRepository: thanhle74/kickama
Length of output: 185
🏁 Script executed:
# Read the diagnostic JSON file to verify the password presence
cat diagnostic/build-94e0fb08.jsonRepository: thanhle74/kickama
Length of output: 5120
🏁 Script executed:
# Read build.py to understand how the artifact is created
cat -n build.py | head -100Repository: thanhle74/kickama
Length of output: 3742
🏁 Script executed:
# Check the CI workflow configuration
cat .github/workflows/diagnostic-build-log.ymlRepository: thanhle74/kickama
Length of output: 10279
🏁 Script executed:
# Read more of build.py to understand password generation and encryption
wc -l build.py && tail -n +100 build.py | head -200Repository: thanhle74/kickama
Length of output: 6347
🏁 Script executed:
# Search for password generation logic
rg "password" build.py -B 2 -A 2Repository: thanhle74/kickama
Length of output: 1438
🏁 Script executed:
# Check for any environment variable handling for secrets
rg "environ|getenv|secret|token" build.py -iRepository: thanhle74/kickama
Length of output: 212
Remove plaintext password from committed diagnostic metadata before merging.
The password and decrypt_command embed a bearer secret in the PR that will be visible to all reviewers and persist in git history until artifacts are removed. Although the PR note correctly states these diagnostic artifacts are temporary, the password is required by the CI validation workflow and cannot be omitted from the JSON without redesigning the CI contract to pass secrets differently (e.g., via GitHub encrypted secrets or environment variables). Verify with maintainers whether the CI workflow can be modified to handle decryption credentials outside of committed metadata, or ensure these artifacts are removed immediately after merge.
Additionally, artifact paths expose the builder's absolute filesystem layout (e.g., /Users/zhangsiqiang/Documents/Codex/...). Consider using relative paths or truncating paths to remove usernames and workspace details.
🧰 Tools
🪛 Betterleaks (1.5.0)
[high] 9-9: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@diagnostic/build-94e0fb08.json` around lines 9 - 10, Remove the plaintext
password value from the "password" field in the diagnostic/build-94e0fb08.json
metadata file and redact or remove the embedded password from the
"decrypt_command" field to prevent the secret from persisting in git history and
being visible to reviewers. Additionally, review and replace any absolute
filesystem paths (such as those containing usernames like
"/Users/zhangsiqiang/...") with relative paths or truncated versions that do not
expose system-specific details. Coordinate with maintainers to modify the CI
validation workflow to pass decryption credentials through GitHub encrypted
secrets or environment variables instead of committing them in the JSON
metadata.
Source: Linters/SAST tools
| "artifact": "/Users/zhangsiqiang/Documents/Codex/2026-06-19/new-chat/outputs/thanhle-kickama-15-health-retry/frontend/dist", | ||
| "output": "> tent-frontend@0.0.0 build\n> tsc -b && vite build\n\nvite v6.4.3 building for production...\ntransforming...\n\u2713 100 modules transformed.\nrendering chunks...\ncomputing gzip size...\ndist/index.html 0.62 kB \u2502 gzip: 0.34 kB\ndist/assets/state-BkjSKDbY.js 8.91 kB \u2502 gzip: 3.55 kB \u2502 map: 57.15 kB\ndist/assets/vendor-CREcWLHI.js 48.93 kB \u2502 gzip: 17.22 kB \u2502 map: 481.27 kB\ndist/assets/index-CyxcoTyU.js 231.32 kB \u2502 gzip: 72.02 kB \u2502 map: 1,044.42 kB\n\u2713 built in 645ms" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /tmp/repository 2>/dev/null || cd . && find . -name "diagnostic" -type d | head -5Repository: thanhle74/kickama
Length of output: 74
🏁 Script executed:
find . -name "build-94e0fb08.json" -type fRepository: thanhle74/kickama
Length of output: 94
🏁 Script executed:
cat -n diagnostic/build-94e0fb08.json 2>/dev/null | head -60Repository: thanhle74/kickama
Length of output: 4600
🏁 Script executed:
grep -r "artifact" --include="*.py" --include="*.js" --include="*.ts" --include="*.json" --include="*.sh" . 2>/dev/null | grep -v "\.git" | grep -v "node_modules" | head -20Repository: thanhle74/kickama
Length of output: 1884
🏁 Script executed:
grep -r "build-94e0fb08" . 2>/dev/null | grep -v "\.git"Repository: thanhle74/kickama
Length of output: 601
🏁 Script executed:
ls -la diagnostic/ | head -20Repository: thanhle74/kickama
Length of output: 482
🏁 Script executed:
grep -B 5 -A 5 '"artifact": binary' build.pyRepository: thanhle74/kickama
Length of output: 419
🏁 Script executed:
grep -B 10 -A 10 'def.*artifact' build.py | head -50Repository: thanhle74/kickama
Length of output: 847
🏁 Script executed:
cat build.py | head -100Repository: thanhle74/kickama
Length of output: 3042
🏁 Script executed:
grep -B 20 "artifact.*binary" build.py | head -60Repository: thanhle74/kickama
Length of output: 2406
🏁 Script executed:
grep -A 5 "def build_module" build.pyRepository: thanhle74/kickama
Length of output: 184
🏁 Script executed:
grep -B 5 -A 15 "for name, success, elapsed, output, binary in results" build.pyRepository: thanhle74/kickama
Length of output: 2755
🏁 Script executed:
grep -B 30 "def build_module" build.py | tail -40Repository: thanhle74/kickama
Length of output: 777
🏁 Script executed:
grep -A 50 "def build_module" build.py | head -60Repository: thanhle74/kickama
Length of output: 1925
🏁 Script executed:
rg "binary" build.py -B 2 -A 2Repository: thanhle74/kickama
Length of output: 2752
🏁 Script executed:
grep -A 15 "def verify_binary" build.pyRepository: thanhle74/kickama
Length of output: 540
Convert artifact paths to repo-relative format to prevent exposing builder username and workspace layout.
The absolute paths (e.g., /Users/zhangsiqiang/Documents/Codex/2026-06-19/new-chat/outputs/...) appear at lines 26, 47, and 54 in this diagnostic file that is committed to the repository. Since the artifact field serves only for display and logging purposes—not for actual file references—convert these to paths relative to the repository root before serializing to JSON (and consider the same for build.log and build-summary.txt). This has minimal implementation cost and eliminates private information from the committed diagnostic record.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@diagnostic/build-94e0fb08.json` around lines 26 - 27, The diagnostic JSON
file contains absolute file paths in the artifact field (and potentially in
build.log and build-summary.txt fields) that expose the builder's username and
local workspace structure. Locate the code responsible for generating this
diagnostic JSON file and modify it to convert these absolute paths to
repository-relative paths before serializing and writing the JSON. The
conversion should strip the machine-specific prefix and maintain only the path
relative to the repository root, since these fields are used for display and
logging purposes only. Apply this path conversion consistently to the artifact
field on lines 26 and 47 and 54, as well as to any build.log and
build-summary.txt fields.
| class HealthCheckRetryCircuitTest(unittest.TestCase): | ||
| def test_http_probe_retries_until_success(self) -> None: |
There was a problem hiding this comment.
Isolate tests from global circuit-breaker state.
check_http_service reuses HTTP_CIRCUIT_BREAKERS, but this test class never clears it. That makes outcomes order-dependent when tests are reordered or sharded.
Proposed fix
class HealthCheckRetryCircuitTest(unittest.TestCase):
+ def setUp(self) -> None:
+ health_check.HTTP_CIRCUIT_BREAKERS.clear()
+
def test_http_probe_retries_until_success(self) -> None:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_health_check_retry_circuit.py` around lines 13 - 14, The
HealthCheckRetryCircuitTest class does not isolate its tests from the global
HTTP_CIRCUIT_BREAKERS state that is shared across test runs. Add a setUp method
and/or tearDown method to the HealthCheckRetryCircuitTest class to clear the
HTTP_CIRCUIT_BREAKERS global state before and/or after each test method
executes. This ensures that each test method in the class starts with a clean
circuit breaker state, making test outcomes independent of execution order when
tests are reordered or sharded.
| breaker.record_failure() | ||
| last_error = f"HTTP {status}: {body[:100]}" | ||
| if attempt == attempts_allowed - 1: | ||
| return "CRITICAL", f"{last_error} after {attempts} attempts", status | ||
| except Exception as e: | ||
| breaker.record_failure() | ||
| last_error = str(e) | ||
| if attempt == attempts_allowed - 1: | ||
| return "CRITICAL", f"{last_error} after {attempts} attempts", 0 | ||
|
|
||
| delay = retry_delay(backoff_base_delay, backoff_factor, attempt) | ||
| LOGGER.warning( | ||
| "HTTP health probe failed for %s; retrying in %.2fs", | ||
| endpoint_key(host, port, path), | ||
| delay, | ||
| ) | ||
| if delay > 0: | ||
| sleep(delay) |
There was a problem hiding this comment.
Circuit breaker is not enforced once it opens during retries.
After record_failure() opens the breaker, this loop still sleeps and keeps retrying until attempts_allowed is exhausted. That defeats the breaker’s protection goal and can continue hammering a failing endpoint.
Proposed fix
try:
status, body = request_once(host, port, path, timeout)
if status < 500:
breaker.record_success()
return classify_http_status(status, body, attempts)
breaker.record_failure()
last_error = f"HTTP {status}: {body[:100]}"
+ if not breaker.allow_request():
+ LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path))
+ return "CRITICAL", f"Circuit breaker open after {attempts} attempts", status
if attempt == attempts_allowed - 1:
return "CRITICAL", f"{last_error} after {attempts} attempts", status
except Exception as e:
breaker.record_failure()
last_error = str(e)
+ if not breaker.allow_request():
+ LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path))
+ return "CRITICAL", f"Circuit breaker open after {attempts} attempts", 0
if attempt == attempts_allowed - 1:
return "CRITICAL", f"{last_error} after {attempts} attempts", 0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| breaker.record_failure() | |
| last_error = f"HTTP {status}: {body[:100]}" | |
| if attempt == attempts_allowed - 1: | |
| return "CRITICAL", f"{last_error} after {attempts} attempts", status | |
| except Exception as e: | |
| breaker.record_failure() | |
| last_error = str(e) | |
| if attempt == attempts_allowed - 1: | |
| return "CRITICAL", f"{last_error} after {attempts} attempts", 0 | |
| delay = retry_delay(backoff_base_delay, backoff_factor, attempt) | |
| LOGGER.warning( | |
| "HTTP health probe failed for %s; retrying in %.2fs", | |
| endpoint_key(host, port, path), | |
| delay, | |
| ) | |
| if delay > 0: | |
| sleep(delay) | |
| breaker.record_failure() | |
| last_error = f"HTTP {status}: {body[:100]}" | |
| if not breaker.allow_request(): | |
| LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path)) | |
| return "CRITICAL", f"Circuit breaker open after {attempts} attempts", status | |
| if attempt == attempts_allowed - 1: | |
| return "CRITICAL", f"{last_error} after {attempts} attempts", status | |
| except Exception as e: | |
| breaker.record_failure() | |
| last_error = str(e) | |
| if not breaker.allow_request(): | |
| LOGGER.warning("HTTP health circuit open for %s", endpoint_key(host, port, path)) | |
| return "CRITICAL", f"Circuit breaker open after {attempts} attempts", 0 | |
| if attempt == attempts_allowed - 1: | |
| return "CRITICAL", f"{last_error} after {attempts} attempts", 0 | |
| delay = retry_delay(backoff_base_delay, backoff_factor, attempt) | |
| LOGGER.warning( | |
| "HTTP health probe failed for %s; retrying in %.2fs", | |
| endpoint_key(host, port, path), | |
| delay, | |
| ) | |
| if delay > 0: | |
| sleep(delay) |
🧰 Tools
🪛 Ruff (0.15.17)
[warning] 217-217: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` around lines 213 - 230, The circuit breaker's
protection is not being enforced during retries because the code continues to
sleep and retry even after record_failure() opens the breaker. After each call
to breaker.record_failure() in both the HTTP status check and exception handler
blocks, check if the breaker has opened using an appropriate method (likely
something like breaker.is_open() or similar). If the breaker is open,
immediately return with a CRITICAL status and an appropriate error message
instead of allowing the code to continue to the delay and sleep logic, ensuring
the breaker stops the retry attempts once it opens.
| if isinstance(check, dict) and "status" in check: | ||
| status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(check["status"], "?") | ||
| status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( | ||
| check["status"], "?" | ||
| ) | ||
| print(f" {status_icon} {name}: {check['detail']}") | ||
| else: | ||
| print(f" {name}:") | ||
| for sub_name, sub_check in check.items(): | ||
| if isinstance(sub_check, dict) and "status" in sub_check: | ||
| sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(sub_check["status"], "?") | ||
| sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( | ||
| sub_check["status"], "?" | ||
| ) | ||
| print(f" {sub_icon} {sub_name}: {sub_check['detail']}") | ||
| print() |
There was a problem hiding this comment.
Nested service checks (e.g., certificate) are no longer rendered.
For service entries, the "status" branch prints and exits, so nested sub-checks like "certificate" are never shown. Critical cert failures can be hidden from text output.
Proposed fix
if isinstance(check, dict) and "status" in check:
status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
check["status"], "?"
)
print(f" {status_icon} {name}: {check['detail']}")
+ for sub_name, sub_check in check.items():
+ if sub_name in {"status", "detail", "code", "attempts_allowed", "circuit", "endpoint"}:
+ continue
+ if isinstance(sub_check, dict) and "status" in sub_check:
+ sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(
+ sub_check["status"], "?"
+ )
+ print(f" {sub_icon} {sub_name}: {sub_check['detail']}")
else:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(check, dict) and "status" in check: | |
| status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(check["status"], "?") | |
| status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( | |
| check["status"], "?" | |
| ) | |
| print(f" {status_icon} {name}: {check['detail']}") | |
| else: | |
| print(f" {name}:") | |
| for sub_name, sub_check in check.items(): | |
| if isinstance(sub_check, dict) and "status" in sub_check: | |
| sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get(sub_check["status"], "?") | |
| sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( | |
| sub_check["status"], "?" | |
| ) | |
| print(f" {sub_icon} {sub_name}: {sub_check['detail']}") | |
| print() | |
| if isinstance(check, dict) and "status" in check: | |
| status_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( | |
| check["status"], "?" | |
| ) | |
| print(f" {status_icon} {name}: {check['detail']}") | |
| for sub_name, sub_check in check.items(): | |
| if sub_name in {"status", "detail", "code", "attempts_allowed", "circuit", "endpoint"}: | |
| continue | |
| if isinstance(sub_check, dict) and "status" in sub_check: | |
| sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( | |
| sub_check["status"], "?" | |
| ) | |
| print(f" {sub_icon} {sub_name}: {sub_check['detail']}") | |
| else: | |
| print(f" {name}:") | |
| for sub_name, sub_check in check.items(): | |
| if isinstance(sub_check, dict) and "status" in sub_check: | |
| sub_icon = {"OK": "✓", "WARNING": "⚠", "CRITICAL": "✗"}.get( | |
| sub_check["status"], "?" | |
| ) | |
| print(f" {sub_icon} {sub_name}: {sub_check['detail']}") | |
| print() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/health_check.py` around lines 483 - 496, The current if/else logic in
the status rendering code creates a mutual exclusion where parent service checks
with a "status" field are printed but their nested sub-checks are skipped
entirely. To fix this, restructure the code to check for and print the parent
status first (when the check is a dict with "status"), then independently
iterate through any nested sub-checks to print them as well. Replace the else
branch that only handles nested checks with logic that always processes nested
items (sub_checks) regardless of whether the parent check has a status field,
ensuring that both parent statuses and nested sub-checks like certificates are
rendered in the output.
Summary
Adds retry/backoff and circuit-breaker support to HTTP health probes in
tools/health_check.py, plus result summary counts for operators.Changes
--max-retries,--backoff-factor,--circuit-threshold, and--circuit-cooldownCLI flags.docs/OPERATIONS.md.Testing
python3 -m unittest tests/test_health_check_retry_circuit.py(6 tests)python3 -m py_compile tools/health_check.py tests/test_health_check_retry_circuit.pypython3 tools/health_check.py --service backend --max-retries 1 --backoff-factor 0 --circuit-threshold 2 --json(smoke run exits 0 and reports degraded local backend)git diff --checkpython3 build.pygenerated real diagnostics. It exited 1 because this runner is missingcargo,go,luac, andghc, and the existingfrailboxC build fails on this platform. Frontend, engine, compliance, and v2 market stream passed.Diagnostic artifacts:
diagnostic/build-94e0fb08.logddiagnostic/build-94e0fb08.jsonDecrypt command:
Checklist
Closes #15
Summary by CodeRabbit
New Features
Documentation
Tests