Skip to content

feat: Wire openclaw:prompt-error events into alerts stream#797

Open
vivekchand wants to merge 1 commit intomainfrom
fix/gh-clawmetry-601-prompt-error-alerts
Open

feat: Wire openclaw:prompt-error events into alerts stream#797
vivekchand wants to merge 1 commit intomainfrom
fix/gh-clawmetry-601-prompt-error-alerts

Conversation

@vivekchand
Copy link
Copy Markdown
Owner

Closes #601

What

Adds visibility for provider-level errors (rate limits, auth issues, context overflow, model-not-found) that were previously ignored in the ClawMetry dashboard.

How

  • New /api/prompt-errors endpoint in routes/overview.py that scans session JSONL files for openclaw:prompt-error custom events
  • Red alert banner on Overview tab showing recent prompt errors with: provider, model, error type, timestamp
  • Auto-refreshes every 30s when on Overview tab
  • Deduplicates errors using timestamp tracking

Changes

  • routes/overview.py: New api_prompt_errors() endpoint
  • dashboard.py: Added loadPromptErrors() JS function + polling timer
  • clawmetry/static/js/app.js: Added loadPromptErrors() function + timer
  • clawmetry/templates/tabs/overview.html: Added prompt error banner UI

Acceptance

  • /api/prompt-errors endpoint filters for openclaw:prompt-error events
  • Banner/alert section on Overview tab showing recent prompt errors
  • Displays: provider, model, error type, timestamp

Copy link
Copy Markdown
Owner Author

@vivekchand vivekchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test plan & review notes

What changed

  • Adds a /api/prompt-errors endpoint that scans session JSONL files for openclaw:prompt-error custom events, plus a red alert banner on the Overview tab that polls the endpoint every 30 s and renders provider/model/error-type/timestamp rows.

Smoke commands

  • make test or make test-api
  • python3 dashboard.py --port 8900
  • Trigger a prompt-error event (or mock one) and verify an alert fires: inject a JSONL line like {"type":"custom","customType":"openclaw:prompt-error","timestamp":"2026-05-05T12:00:00.000Z","data":{"provider":"anthropic","model":"claude-3-opus","error":"rate_limit","timestamp":1746446400000}} into ~/.openclaw/agents/main/sessions/test.jsonl, then hit /api/prompt-errors and reload the Overview tab to confirm the banner appears.

Likely failure modes from the diff

  • Alert storm / no server-side deduplication: The since filter relies on the client tracking _promptErrorLastTs, but that state resets on every page load. If a session file accumulates many errors they will all re-appear on each fresh load. Consider storing dismissed/seen IDs server-side (or in localStorage) and/or enforcing a default since window (e.g. last 1 h) in the endpoint.
  • Timestamp comparison mismatch: The since filter compares pdata.get("timestamp", 0) (an integer in ms from the inner data dict) against the outer obj.get("timestamp", "") ISO string that is stored in the returned payload. If callers pass back the outer ISO timestamp as since, the integer comparison ts <= since_ms will always be 0 <= since_ms and never filter anything. The two timestamp fields need to be kept consistent.
  • Full-dir scan on every poll: Every 30 s the endpoint calls os.listdir + opens every .jsonl file up to 512 KB each. On a workspace with hundreds of long-running sessions this adds up. A lightweight index (mtime cache, or scanning only files modified in the last N minutes) would help.
  • Gateway WS drop: The feature polls the filesystem rather than listening on the gateway WebSocket, so it will still work if the WS drops — but it also means errors that are only emitted live (not written to JSONL) would be missed. Worth confirming that OpenClaw always persists openclaw:prompt-error events to the session file.
  • XSS via unsanitised fields: provider, model, and error values from JSONL are injected directly into innerHTML strings in loadPromptErrors(). A crafted session file could execute arbitrary JS. Use textContent assignment or escape the values before rendering.
  • app.js / dashboard.py duplication: loadPromptErrors() and _promptErrorLastTs are defined in both clawmetry/static/js/app.js and the embedded JS in dashboard.py. They will be declared twice in any page that loads both, causing the _promptErrorLastTs state to be split. Confirm only one definition ends up in the rendered page, or de-duplicate.

Issue link

  • Closes #601 — confirmed in PR body.

Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@vivekchand vivekchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test plan & review notes

What changed

  • Adds a red alert banner to the Overview tab that surfaces openclaw:prompt-error custom events (rate limits, auth failures, context overflow, model-not-found) previously invisible in the dashboard; new GET /api/prompt-errors endpoint in routes/overview.py scans session JSONL files, with a 30s polling loop and incremental since deduplication.

Smoke commands

  • make test-api
  • python3 dashboard.py --port 8900

What to look at visually

  • http://localhost:8900 → Overview tab → top of page — confirm the red prompt-error banner is hidden when there are no errors and appears automatically (no reload needed) within 30s when a openclaw:prompt-error custom event is present in any session JSONL; verify each row shows time, provider badge, model name, and error-type pill
  • http://localhost:8900/api/prompt-errors?limit=10 (raw JSON) — confirm the errors array contains objects with timestamp, provider, model, api, error, runId, and sessionId keys; test with ?since=<unix_ms> to verify incremental filtering works

Likely failure modes

  • The since filter compares pdata.get("timestamp", 0) (a value from the data sub-object, which may be a ms integer or absent) against since_ms, but the top-level obj["timestamp"] used for sorting is an ISO string — if the data.timestamp field is missing or typed differently than expected, since filtering silently passes all events or drops all of them.
  • The JS _promptErrorLastTs tracker is updated from new Date(err.timestamp).getTime() (ms), but the backend since param is compared against pdata.get("timestamp", 0) which could be a different unit or format — mismatched units would cause the incremental poll to re-show the same errors on every 30s tick without ever clearing them.
  • The loadPromptErrors function is duplicated verbatim between clawmetry/static/js/app.js and the embedded JS in dashboard.py; the dashboard.py copy includes a var api = err.api || '' line that the app.js copy omits — this divergence means the api field is silently dropped in the static-file serving path.
  • No unit or integration tests are added for the new endpoint; a tests/test_prompt_errors.py would be the natural companion (compare tests/test_heartbeat.py pattern in PR #812).
  • WebSocket event flow to verify: OpenClaw gateway emits openclaw:prompt-error → OpenClaw writes the event as a {"type":"custom","customType":"openclaw:prompt-error","data":{...}} line into the session JSONL → /api/prompt-errors picks it up on next 512KB tail-read → loadPromptErrors() poll surfaces it in the banner within 30s.

Issue link

  • Closes #601 (confirmed in PR body and branch name fix/gh-clawmetry-601-prompt-error-alerts)

Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@vivekchand vivekchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test plan & review notes

Repo: vivekchand/clawmetry

What changed

  • New GET /api/prompt-errors endpoint in routes/overview.py scanning session JSONL files for openclaw:prompt-error custom events
  • Red alert banner on the Overview tab, auto-refreshing every 30s; deduplicates errors by timestamp

Smoke commands

  • python3 -c 'import ast; ast.parse(open("routes/overview.py").read())' — syntax clean
  • curl -sS http://localhost:8900/api/prompt-errors — expect {"errors": [...]} (empty array is fine if no errors in logs)
  • Inject a synthetic openclaw:prompt-error event into a test JSONL and re-check — should surface in the response
  • curl -sS http://localhost:8900/api/overview — confirm existing endpoint is unaffected

What to look at visually

  • http://localhost:8900 → Overview tab — red alert banner should appear if any prompt errors exist in the 30-day scan window

Likely failure modes from the diff

  • Synchronous JSONL scan: large session directories could make this slow on each 30s poll — check for a per-request limit on sessions scanned or a capped scan window
  • Timestamp deduplication: if two errors share the same timestamp (rare but possible), one would be silently dropped

Issue link


Generated by Claude Code

Copy link
Copy Markdown
Owner Author

@vivekchand vivekchand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test plan & review notes

Repo: vivekchand/clawmetry

What changed

  • Adds /api/prompt-errors endpoint in routes/overview.py that scans session JSONL files for openclaw:prompt-error custom events, plus a red alert banner on the Overview tab that polls every 30s and deduplicates by timestamp.

Smoke commands

  • make test or make test-api
  • python3 dashboard.py --port 8900
  • curl -sS http://localhost:8900/api/prompt-errors — verify response shape is {"errors": [...]}
  • curl -sS "http://localhost:8900/api/prompt-errors?limit=5&since=0" — confirm since and limit params are respected
  • Drop a synthetic JSONL line ({"type":"custom","customType":"openclaw:prompt-error","timestamp":"2026-05-07T10:00:00Z","data":{"provider":"anthropic","model":"claude-3-opus","error":"rate_limit","runId":"r1","sessionId":"s1","api":"anthropic"}}) into a .jsonl file under ~/.openclaw/agents/main/sessions/ and confirm the banner appears on the Overview tab.
  • With a live gateway: trigger a real rate-limit or auth error and confirm it surfaces within the next 30s poll.

Likely failure modes from the diff

  • Timestamp field mismatch: the since filter compares against pdata.get("timestamp", 0) (a value pulled from inside data), but the JS deduplication tracks new Date(err.timestamp).getTime() using the outer obj["timestamp"] field. If those two fields differ (outer ISO string vs. inner ms integer), the deduplication window will drift and either never suppress repeats or suppress valid new errors.
  • Sort key uses outer ISO string lexicographically: errors.sort(key=lambda x: x.get("timestamp", ""), ...) works only if all timestamps are zero-padded ISO-8601. Mixed formats (or missing values defaulting to "") will mis-sort.
  • loadPromptErrors declared twice: the function appears verbatim in both clawmetry/static/js/app.js and in the embedded JS block inside dashboard.py, with a subtle difference — app.js omits the var api = err.api || '' line that dashboard.py's copy has. Whichever is loaded last wins; the discrepancy should be unified to avoid future confusion.
  • _promptErrorPollTimer not cleared on tab leave: the switchTab handler starts the poll when entering overview but there is no else branch that clears it when navigating away (unlike the cron timer pattern). This means the 30s interval keeps firing regardless of which tab is active.
  • Banner never self-dismisses after errors resolve: once _promptErrorLastTs advances, the since filter will exclude all older events on the next poll. If the only errors are "old", errors.length === 0 and the banner hides — that part is correct. But if a user stays on the tab for a long session with no new errors, the initial errors shown at since=0 remain visible until a page refresh. Consider resetting _promptErrorLastTs = 0 on full loadAll() calls.
  • 512KB tail read may split a JSONL line: if a very long entry straddles the size - 512000 seek point, line.strip() will yield a partial JSON string that json.loads silently skips. For correctness, discard only the first (potentially partial) line after the seek.

Issue link

  • Closes #601 (confirmed from PR body)

Generated by Claude Code

Adds visibility for provider-level errors (rate limits, auth issues, context overflow, model-not-found) that were previously hidden.

- New /api/prompt-errors endpoint that scans session JSONL files for openclaw:prompt-error custom events
- Red banner on Overview tab showing recent prompt errors with: provider, model, error type, timestamp
- Auto-refreshes every 30s when on Overview tab
- Deduplicates errors using timestamp tracking

- [x] /api/prompt-errors endpoint filters for openclaw:prompt-error events
- [x] Banner/alert section on Overview tab showing recent prompt errors
- [x] Displays: provider, model, error type, timestamp
@vivekchand vivekchand force-pushed the fix/gh-clawmetry-601-prompt-error-alerts branch from e2b2ea0 to 973bb0a Compare May 7, 2026 07:06
Copy link
Copy Markdown
Owner Author

Test plan & review notes

Repo: vivekchand/clawmetry

What changed

  • New /api/prompt-errors endpoint in routes/overview.py that scans session JSONL for openclaw:prompt-error custom events; adds a polling red-alert banner on the Overview tab (auto-refreshes every 30s)

Smoke commands

  • make test or make test-api
  • python3 dashboard.py --port 8900
  • curl -sS http://localhost:8900/api/prompt-errors → expect JSON list with provider, model, error_type, timestamp fields

What to look at visually

  • http://localhost:8900 → Overview tab — red alert banner should appear when any session JSONL contains openclaw:prompt-error events; absent when there are none (no empty-state flash)
  • Confirm the 30s auto-poll doesn't fire when the browser tab is backgrounded (check loadPromptErrors timer logic)

Likely failure modes from the diff

  • JSONL scanner on a workspace with 50+ sessions may be slow — worth a response-time sanity check
  • Timestamp-based deduplication: verify a session with the same timestamp but a different error type isn't silently dropped

Issue link


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P0] Wire openclaw:prompt-error events into alerts stream

1 participant