Feat/sys diagnostics endpoint#1090
Open
madsysharma wants to merge 1 commit into
Open
Conversation
57d8002 to
f829315
Compare
imDarshanGK
requested changes
Jun 21, 2026
imDarshanGK
left a comment
Owner
There was a problem hiding this comment.
@madsysharma
Please add a video demo of the feature.
f829315 to
a68c42b
Compare
Contributor
Author
|
Hi @imDarshanGK , have added a video of the demo. Please review. Thank you. |
05a6ae6 to
1625218
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GET /diagsystem diagnostics endpoint that returns a minimal, non-sensitive JSON snapshot of process/system memory, CPU, and queue depth to support quick troubleshooting without shelling into a container.Related Issue
Closes #628
What & why
The issue asks for a limited diagnostics endpoint exposing non-sensitive info (memory, CPU, queue depth), guarded behind admin auth or an IP allowlist, with minimal output that avoids leaking credentials. This PR implements exactly that, following the conventions already established by the existing operational endpoints (
/metrics,/healthz/*).Safety model
The endpoint is built to be safe by default:
DIAG_ENABLED=true; otherwise returns404so its existence is not advertised (same approach as/metrics).403unless at least one access control is configured.X-Forwarded-Foris ignored for the allowlist check unlessDIAG_TRUST_FORWARDED_FOR=true(set only behind a trusted proxy). The token comparison is constant-time (hmac.compare_digest).Config
All flags are read at request time (consistent with
/metrics), so operators can flip them without a restart.DIAG_ENABLEDfalse404while disabled.DIAG_AUTH_TOKENAuthorization: Bearer <token>grants access.DIAG_IP_ALLOWLIST10.0.0.0/8,127.0.0.1).DIAG_TRUST_FORWARDED_FORfalseX-Forwarded-Forfor the allowlist check.Optional dependency
psutilis added as an optional but recommended dependency. When present, the endpoint returns rich process/system stats; when absent, it degrades gracefully to stdlib-only metrics (os.getloadavg,resource.getrusage, andVmRSSfrom/proc/self/statuson Linux) and reportsruntime.psutil_available: false. Both paths are exercised.Changes
backend/app/routers/diagnostics.py(new): the/diagroute, authorization logic (token + IP/CIDR allowlist), and stat collection (process / system / queue / runtime) with a psutil-optional + stdlib fallback.backend/tests/test_diagnostics.py(new): 9 tests covering the access policy and payload shape.backend/app/main.py:register the diagnostics router alongside the other operational endpoints.backend/app/observability.py: addinflight_request_count()helper (sums the in-progress request gauge) used as the queue-depth signal.backend/app/schemas.py: add theDiagnosticsResponsemodel (kept inschemas.pyper repo convention).backend/requirements.txt: addpsutil>=5.9.0(optional; documented as gracefully degradable).docs/admin.md,README.md,docs/CHANGELOG.md: document the endpoint, its env vars, response shape, and access-restriction guidance.Response structure
{ "status": "ok", "timestamp": "2026-05-30T12:00:00+00:00", "uptime_seconds": 1342.51, "process": { "pid": 42, "memory_rss_bytes": 78643200, "memory_rss_mb": 75.0, "memory_percent": 1.83, "num_threads": 9, "cpu_user_seconds": 4.21, "cpu_system_seconds": 1.07, "num_fds": 23 }, "system": { "cpu_count": 4, "load_average": [0.31, 0.27, 0.22], "cpu_percent": 6.0, "memory_total_bytes": 8323039232, "memory_available_bytes": 5123440640, "memory_percent": 38.4 }, "queue": { "inflight_requests": 1.0, "scheduled_jobs": 1, "rate_limited_clients": 0 }, "runtime": { "python_version": "3.12.3", "platform": "linux", "psutil_available": true, "gc_objects": 51234 } }Testing
Run the tests:
cd backend pytest tests/test_diagnostics.py -vVideo of demo
Screen.Recording.2026-06-21.120247.mp4
Checklist
pytest -vpasses (all tests green — 420 total, 9 new)docs/CHANGELOG.mdupdatedblack/isortformatting clean on changed filesruff check backend/app --select E,F,W --ignore E501cleanmainNotes for the reviewer
scheduled_jobs(background scheduler) andrate_limited_clientsas additional, cheap queue/load signals. Happy to trim these if you'd prefer the payload even more minimal.include_in_schema=False), matching/metrics, so it won't appear in/docsor/redoc.