feat: Wire openclaw:prompt-error events into alerts stream#797
feat: Wire openclaw:prompt-error events into alerts stream#797vivekchand wants to merge 1 commit intomainfrom
Conversation
60d6b93 to
e2b2ea0
Compare
vivekchand
left a comment
There was a problem hiding this comment.
Test plan & review notes
What changed
- Adds a
/api/prompt-errorsendpoint that scans session JSONL files foropenclaw:prompt-errorcustom events, plus a red alert banner on the Overview tab that polls the endpoint every 30 s and renders provider/model/error-type/timestamp rows.
Smoke commands
make testormake test-apipython3 dashboard.py --port 8900- Trigger a prompt-error event (or mock one) and verify an alert fires: inject a JSONL line like
{"type":"custom","customType":"openclaw:prompt-error","timestamp":"2026-05-05T12:00:00.000Z","data":{"provider":"anthropic","model":"claude-3-opus","error":"rate_limit","timestamp":1746446400000}}into~/.openclaw/agents/main/sessions/test.jsonl, then hit/api/prompt-errorsand reload the Overview tab to confirm the banner appears.
Likely failure modes from the diff
- Alert storm / no server-side deduplication: The
sincefilter relies on the client tracking_promptErrorLastTs, but that state resets on every page load. If a session file accumulates many errors they will all re-appear on each fresh load. Consider storing dismissed/seen IDs server-side (or inlocalStorage) and/or enforcing a defaultsincewindow (e.g. last 1 h) in the endpoint. - Timestamp comparison mismatch: The
sincefilter comparespdata.get("timestamp", 0)(an integer in ms from the innerdatadict) against the outerobj.get("timestamp", "")ISO string that is stored in the returned payload. If callers pass back the outer ISO timestamp assince, the integer comparisonts <= since_mswill always be0 <= since_msand never filter anything. The two timestamp fields need to be kept consistent. - Full-dir scan on every poll: Every 30 s the endpoint calls
os.listdir+ opens every.jsonlfile up to 512 KB each. On a workspace with hundreds of long-running sessions this adds up. A lightweight index (mtime cache, or scanning only files modified in the last N minutes) would help. - Gateway WS drop: The feature polls the filesystem rather than listening on the gateway WebSocket, so it will still work if the WS drops — but it also means errors that are only emitted live (not written to JSONL) would be missed. Worth confirming that OpenClaw always persists
openclaw:prompt-errorevents to the session file. - XSS via unsanitised fields:
provider,model, anderrorvalues from JSONL are injected directly intoinnerHTMLstrings inloadPromptErrors(). A crafted session file could execute arbitrary JS. UsetextContentassignment or escape the values before rendering. app.js/dashboard.pyduplication:loadPromptErrors()and_promptErrorLastTsare defined in bothclawmetry/static/js/app.jsand the embedded JS indashboard.py. They will be declared twice in any page that loads both, causing the_promptErrorLastTsstate to be split. Confirm only one definition ends up in the rendered page, or de-duplicate.
Issue link
- Closes #601 — confirmed in PR body.
Generated by Claude Code
vivekchand
left a comment
There was a problem hiding this comment.
Test plan & review notes
What changed
- Adds a red alert banner to the Overview tab that surfaces
openclaw:prompt-errorcustom events (rate limits, auth failures, context overflow, model-not-found) previously invisible in the dashboard; newGET /api/prompt-errorsendpoint inroutes/overview.pyscans session JSONL files, with a 30s polling loop and incrementalsincededuplication.
Smoke commands
make test-apipython3 dashboard.py --port 8900
What to look at visually
http://localhost:8900→ Overview tab → top of page — confirm the red prompt-error banner is hidden when there are no errors and appears automatically (no reload needed) within 30s when aopenclaw:prompt-errorcustom event is present in any session JSONL; verify each row shows time, provider badge, model name, and error-type pillhttp://localhost:8900/api/prompt-errors?limit=10(raw JSON) — confirm theerrorsarray contains objects withtimestamp,provider,model,api,error,runId, andsessionIdkeys; test with?since=<unix_ms>to verify incremental filtering works
Likely failure modes
- The
sincefilter comparespdata.get("timestamp", 0)(a value from thedatasub-object, which may be a ms integer or absent) againstsince_ms, but the top-levelobj["timestamp"]used for sorting is an ISO string — if thedata.timestampfield is missing or typed differently than expected,sincefiltering silently passes all events or drops all of them. - The JS
_promptErrorLastTstracker is updated fromnew Date(err.timestamp).getTime()(ms), but the backendsinceparam is compared againstpdata.get("timestamp", 0)which could be a different unit or format — mismatched units would cause the incremental poll to re-show the same errors on every 30s tick without ever clearing them. - The
loadPromptErrorsfunction is duplicated verbatim betweenclawmetry/static/js/app.jsand the embedded JS indashboard.py; thedashboard.pycopy includes avar api = err.api || ''line that theapp.jscopy omits — this divergence means theapifield is silently dropped in the static-file serving path. - No unit or integration tests are added for the new endpoint; a
tests/test_prompt_errors.pywould be the natural companion (comparetests/test_heartbeat.pypattern in PR #812). - WebSocket event flow to verify: OpenClaw gateway emits
openclaw:prompt-error→ OpenClaw writes the event as a{"type":"custom","customType":"openclaw:prompt-error","data":{...}}line into the session JSONL →/api/prompt-errorspicks it up on next 512KB tail-read →loadPromptErrors()poll surfaces it in the banner within 30s.
Issue link
- Closes #601 (confirmed in PR body and branch name
fix/gh-clawmetry-601-prompt-error-alerts)
Generated by Claude Code
vivekchand
left a comment
There was a problem hiding this comment.
Test plan & review notes
Repo: vivekchand/clawmetry
What changed
- New
GET /api/prompt-errorsendpoint inroutes/overview.pyscanning session JSONL files foropenclaw:prompt-errorcustom events - Red alert banner on the Overview tab, auto-refreshing every 30s; deduplicates errors by timestamp
Smoke commands
python3 -c 'import ast; ast.parse(open("routes/overview.py").read())'— syntax cleancurl -sS http://localhost:8900/api/prompt-errors— expect{"errors": [...]}(empty array is fine if no errors in logs)- Inject a synthetic
openclaw:prompt-errorevent into a test JSONL and re-check — should surface in the response curl -sS http://localhost:8900/api/overview— confirm existing endpoint is unaffected
What to look at visually
http://localhost:8900→ Overview tab — red alert banner should appear if any prompt errors exist in the 30-day scan window
Likely failure modes from the diff
- Synchronous JSONL scan: large session directories could make this slow on each 30s poll — check for a per-request limit on sessions scanned or a capped scan window
- Timestamp deduplication: if two errors share the same timestamp (rare but possible), one would be silently dropped
Issue link
- Closes #601
Generated by Claude Code
vivekchand
left a comment
There was a problem hiding this comment.
Test plan & review notes
Repo: vivekchand/clawmetry
What changed
- Adds
/api/prompt-errorsendpoint inroutes/overview.pythat scans session JSONL files foropenclaw:prompt-errorcustom events, plus a red alert banner on the Overview tab that polls every 30s and deduplicates by timestamp.
Smoke commands
make testormake test-apipython3 dashboard.py --port 8900curl -sS http://localhost:8900/api/prompt-errors— verify response shape is{"errors": [...]}curl -sS "http://localhost:8900/api/prompt-errors?limit=5&since=0"— confirmsinceandlimitparams are respected- Drop a synthetic JSONL line (
{"type":"custom","customType":"openclaw:prompt-error","timestamp":"2026-05-07T10:00:00Z","data":{"provider":"anthropic","model":"claude-3-opus","error":"rate_limit","runId":"r1","sessionId":"s1","api":"anthropic"}}) into a.jsonlfile under~/.openclaw/agents/main/sessions/and confirm the banner appears on the Overview tab. - With a live gateway: trigger a real rate-limit or auth error and confirm it surfaces within the next 30s poll.
Likely failure modes from the diff
- Timestamp field mismatch: the
sincefilter compares againstpdata.get("timestamp", 0)(a value pulled from insidedata), but the JS deduplication tracksnew Date(err.timestamp).getTime()using the outerobj["timestamp"]field. If those two fields differ (outer ISO string vs. inner ms integer), the deduplication window will drift and either never suppress repeats or suppress valid new errors. - Sort key uses outer ISO string lexicographically:
errors.sort(key=lambda x: x.get("timestamp", ""), ...)works only if all timestamps are zero-padded ISO-8601. Mixed formats (or missing values defaulting to"") will mis-sort. loadPromptErrorsdeclared twice: the function appears verbatim in bothclawmetry/static/js/app.jsand in the embedded JS block insidedashboard.py, with a subtle difference —app.jsomits thevar api = err.api || ''line thatdashboard.py's copy has. Whichever is loaded last wins; the discrepancy should be unified to avoid future confusion._promptErrorPollTimernot cleared on tab leave: theswitchTabhandler starts the poll when enteringoverviewbut there is noelsebranch that clears it when navigating away (unlike the cron timer pattern). This means the 30s interval keeps firing regardless of which tab is active.- Banner never self-dismisses after errors resolve: once
_promptErrorLastTsadvances, thesincefilter will exclude all older events on the next poll. If the only errors are "old",errors.length === 0and the banner hides — that part is correct. But if a user stays on the tab for a long session with no new errors, the initial errors shown atsince=0remain visible until a page refresh. Consider resetting_promptErrorLastTs = 0on fullloadAll()calls. - 512KB tail read may split a JSONL line: if a very long entry straddles the
size - 512000seek point,line.strip()will yield a partial JSON string thatjson.loadssilently skips. For correctness, discard only the first (potentially partial) line after the seek.
Issue link
- Closes #601 (confirmed from PR body)
Generated by Claude Code
Adds visibility for provider-level errors (rate limits, auth issues, context overflow, model-not-found) that were previously hidden. - New /api/prompt-errors endpoint that scans session JSONL files for openclaw:prompt-error custom events - Red banner on Overview tab showing recent prompt errors with: provider, model, error type, timestamp - Auto-refreshes every 30s when on Overview tab - Deduplicates errors using timestamp tracking - [x] /api/prompt-errors endpoint filters for openclaw:prompt-error events - [x] Banner/alert section on Overview tab showing recent prompt errors - [x] Displays: provider, model, error type, timestamp
e2b2ea0 to
973bb0a
Compare
Test plan & review notesRepo: vivekchand/clawmetry What changed
Smoke commands
What to look at visually
Likely failure modes from the diff
Issue link Generated by Claude Code |
Closes #601
What
Adds visibility for provider-level errors (rate limits, auth issues, context overflow, model-not-found) that were previously ignored in the ClawMetry dashboard.
How
/api/prompt-errorsendpoint inroutes/overview.pythat scans session JSONL files foropenclaw:prompt-errorcustom eventsChanges
routes/overview.py: Newapi_prompt_errors()endpointdashboard.py: AddedloadPromptErrors()JS function + polling timerclawmetry/static/js/app.js: AddedloadPromptErrors()function + timerclawmetry/templates/tabs/overview.html: Added prompt error banner UIAcceptance
/api/prompt-errorsendpoint filters for openclaw:prompt-error events