Skip to content

Feat/sys diagnostics endpoint#1090

Open
madsysharma wants to merge 1 commit into
imDarshanGK:mainfrom
madsysharma:feat/sys-diagnostics-endpoint
Open

Feat/sys diagnostics endpoint#1090
madsysharma wants to merge 1 commit into
imDarshanGK:mainfrom
madsysharma:feat/sys-diagnostics-endpoint

Conversation

@madsysharma

@madsysharma madsysharma commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds a protected GET /diag system diagnostics endpoint that returns a minimal, non-sensitive JSON snapshot of process/system memory, CPU, and queue depth to support quick troubleshooting without shelling into a container.
  • Endpoint is locked down by default and exposes no secrets.

Related Issue

Closes #628

What & why

The issue asks for a limited diagnostics endpoint exposing non-sensitive info (memory, CPU, queue depth), guarded behind admin auth or an IP allowlist, with minimal output that avoids leaking credentials. This PR implements exactly that, following the conventions already established by the existing operational endpoints (/metrics, /healthz/*).

Safety model

The endpoint is built to be safe by default:

  • Disabled by default: only serves when DIAG_ENABLED=true; otherwise returns 404 so its existence is not advertised (same approach as /metrics).
  • Never unguarded: even when enabled, it returns 403 unless at least one access control is configured.
  • Two ways in: a request is authorised by a matching admin bearer token or an allowlisted client IP/CIDR (satisfying the issue's "admin auth or IP allowlist").
  • Spoof-proof: X-Forwarded-For is ignored for the allowlist check unless DIAG_TRUST_FORWARDED_FOR=true (set only behind a trusted proxy). The token comparison is constant-time (hmac.compare_digest).
  • Minimal output: process/system stats only. No environment variables, secrets, connection strings, tokens, request bodies, or hostnames. A test asserts the configured token is never echoed and that no secret-ish keys appear in the payload.

Config

All flags are read at request time (consistent with /metrics), so operators can flip them without a restart.

Variable Default Description
DIAG_ENABLED false Master switch. Returns 404 while disabled.
DIAG_AUTH_TOKEN Admin bearer token. Authorization: Bearer <token> grants access.
DIAG_IP_ALLOWLIST Comma-separated IPs/CIDRs allowed (e.g. 10.0.0.0/8,127.0.0.1).
DIAG_TRUST_FORWARDED_FOR false Trust left-most X-Forwarded-For for the allowlist check.

Optional dependency

psutil is added as an optional but recommended dependency. When present, the endpoint returns rich process/system stats; when absent, it degrades gracefully to stdlib-only metrics (os.getloadavg, resource.getrusage, and VmRSS from /proc/self/status on Linux) and reports runtime.psutil_available: false. Both paths are exercised.

Changes

  • backend/app/routers/diagnostics.py (new): the /diag route, authorization logic (token + IP/CIDR allowlist), and stat collection (process / system / queue / runtime) with a psutil-optional + stdlib fallback.
  • backend/tests/test_diagnostics.py (new): 9 tests covering the access policy and payload shape.
  • backend/app/main.py:register the diagnostics router alongside the other operational endpoints.
  • backend/app/observability.py: add inflight_request_count() helper (sums the in-progress request gauge) used as the queue-depth signal.
  • backend/app/schemas.py: add the DiagnosticsResponse model (kept in schemas.py per repo convention).
  • backend/requirements.txt: add psutil>=5.9.0 (optional; documented as gracefully degradable).
  • docs/admin.md, README.md, docs/CHANGELOG.md: document the endpoint, its env vars, response shape, and access-restriction guidance.

Note: black/isort (the versions pinned in CI) touched a few pre-existing blank lines in observability.py and schemas.py. The CI format job runs black --check on every file in the PR diff (all-or-nothing per file), so these formatter-only adjustments are required for the check to pass: no logic changed.

Response structure

{
  "status": "ok",
  "timestamp": "2026-05-30T12:00:00+00:00",
  "uptime_seconds": 1342.51,
  "process": {
    "pid": 42,
    "memory_rss_bytes": 78643200,
    "memory_rss_mb": 75.0,
    "memory_percent": 1.83,
    "num_threads": 9,
    "cpu_user_seconds": 4.21,
    "cpu_system_seconds": 1.07,
    "num_fds": 23
  },
  "system": {
    "cpu_count": 4,
    "load_average": [0.31, 0.27, 0.22],
    "cpu_percent": 6.0,
    "memory_total_bytes": 8323039232,
    "memory_available_bytes": 5123440640,
    "memory_percent": 38.4
  },
  "queue": {
    "inflight_requests": 1.0,
    "scheduled_jobs": 1,
    "rate_limited_clients": 0
  },
  "runtime": {
    "python_version": "3.12.3",
    "platform": "linux",
    "psutil_available": true,
    "gc_objects": 51234
  }
}

inflight_requests includes the diagnostics request currently being served, so 1 under no other load is expected.

Testing

# 1. Run the backend
cd backend
uvicorn app.main:app --reload

# 2. Disabled by default -> 404
curl -i http://localhost:8000/diag

# 3. Enable with an admin token
DIAG_ENABLED=true DIAG_AUTH_TOKEN=secret uvicorn app.main:app --reload
curl -s -H "Authorization: Bearer secret" http://localhost:8000/diag | jq

# 4. Or gate by IP allowlist instead
DIAG_ENABLED=true DIAG_IP_ALLOWLIST=127.0.0.1 uvicorn app.main:app --reload
curl -s http://localhost:8000/diag | jq

Run the tests:

cd backend
pytest tests/test_diagnostics.py -v

Video of demo

Screen.Recording.2026-06-21.120247.mp4

Checklist

  • pytest -v passes (all tests green — 420 total, 9 new)
  • New feature has at least one test
  • Code has type hints and docstrings
  • README updated (behavior changed)
  • docs/CHANGELOG.md updated
  • black / isort formatting clean on changed files
  • ruff check backend/app --select E,F,W --ignore E501 clean
  • Branch is up-to-date with main
  • PR description explains what and why

Notes for the reviewer

  • I considered modeling "queue depth" purely as in-flight HTTP requests, but added scheduled_jobs (background scheduler) and rate_limited_clients as additional, cheap queue/load signals. Happy to trim these if you'd prefer the payload even more minimal.
  • Endpoint is excluded from the OpenAPI schema (include_in_schema=False), matching /metrics, so it won't appear in /docs or /redoc.

@madsysharma madsysharma requested a review from imDarshanGK as a code owner June 21, 2026 15:58
@madsysharma madsysharma force-pushed the feat/sys-diagnostics-endpoint branch 3 times, most recently from 57d8002 to f829315 Compare June 21, 2026 18:24

@imDarshanGK imDarshanGK left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madsysharma
Please add a video demo of the feature.

Image

@madsysharma madsysharma force-pushed the feat/sys-diagnostics-endpoint branch from f829315 to a68c42b Compare June 21, 2026 18:52
@madsysharma

Copy link
Copy Markdown
Contributor Author

Hi @imDarshanGK , have added a video of the demo. Please review. Thank you.

@madsysharma madsysharma requested a review from imDarshanGK June 21, 2026 19:05
@madsysharma madsysharma force-pushed the feat/sys-diagnostics-endpoint branch from 05a6ae6 to 1625218 Compare June 22, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add system diagnostics endpoint for debugging

2 participants