feat(agent): harden hierarchical stress benchmark by JaimeCernuda · Pull Request #322 · iowarp/clio-agent

JaimeCernuda · 2026-05-24T08:26:29Z

Summary

expands the real-provider CLIO benchmark into a 21-case hierarchical stress campaign with compaction, provider/model swap, dirty data, tier-3/nanoagent, NDP/SAC, surfaced-error, and visualization artifact coverage
hardens CLIO/GACT execution paths found during the campaign: transient-provider retries for expert dispatch/compaction, ARC-backed compaction memory with exact retained evidence, retained-context dispatch cleanup, explicit file precedence in analysis, nested NDP ownership, Parquet/statistical promotion to analysis, and visualization-intent promotion
regenerates docs/ALCF_DEMO_BENCHMARK_REPORT.md from saved evidence with provider/model/settings, route graph, handoffs, tools/results, child sessions, artifacts, timings, fixes, and caveats

Real-provider benchmark evidence

Command:

uv run python scripts/run_demo_benchmark.py --base-url http://127.0.0.1:17983 --case-delay-s 5 --require-stress-criteria --output-jsonl tmp/clio-demo-benchmark-alcf-metis-20260524-stress-final4.jsonl --report docs/ALCF_DEMO_BENCHMARK_REPORT.md

Result:

21 cases total
19 clean passes
2 expected surfaced errors
0 partial recoveries
0 failures
stress audit passed: 10/10 complex demos, 6/5 long/high-event, 5/3 tier-3/nanoagent, 5/3 visualization artifacts, 2/2 surfaced errors, 1/1 compaction, 1/1 provider/model swap

Verification

uv run ruff format scripts/run_demo_benchmark.py
uv run ruff check scripts/run_demo_benchmark.py
uv run python -m py_compile scripts/run_demo_benchmark.py
uv run python scripts/run_demo_benchmark.py --render-existing-jsonl tmp/clio-demo-benchmark-alcf-metis-20260524-stress-final4.jsonl --report tmp/clio-demo-benchmark-report-render-check.md
uv run pytest tests/ -q -> 1180 passed, 37 skipped in 136.01s

Caveats

The JSONL evidence file lives under ignored tmp/; the committed markdown report references it and contains the summarized evidence.
bar_chart_status.png was generated locally by the run and left uncommitted as a benchmark artifact, not a source file.
ALCF latency/token freshness may differ on future runs; the report records the provider/model/settings observed for this run.

JaimeCernuda force-pushed the feat/stress-benchmark-audit-20260524 branch 2 times, most recently from 974ffa9 to ce9f0a3 Compare May 24, 2026 09:00

feat(agent): harden hierarchical stress benchmark

0c12056

JaimeCernuda force-pushed the feat/stress-benchmark-audit-20260524 branch from ce9f0a3 to 0c12056 Compare May 24, 2026 09:09

JaimeCernuda merged commit 058bb26 into develop May 24, 2026
1 check passed

JaimeCernuda deleted the feat/stress-benchmark-audit-20260524 branch May 24, 2026 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): harden hierarchical stress benchmark#322

feat(agent): harden hierarchical stress benchmark#322
JaimeCernuda merged 1 commit into
developfrom
feat/stress-benchmark-audit-20260524

JaimeCernuda commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JaimeCernuda commented May 24, 2026

Summary

Real-provider benchmark evidence

Verification

Caveats

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant