Skip to content

feat(agent): harden hierarchical stress benchmark#322

Merged
JaimeCernuda merged 1 commit into
developfrom
feat/stress-benchmark-audit-20260524
May 24, 2026
Merged

feat(agent): harden hierarchical stress benchmark#322
JaimeCernuda merged 1 commit into
developfrom
feat/stress-benchmark-audit-20260524

Conversation

@JaimeCernuda
Copy link
Copy Markdown
Collaborator

Summary

  • expands the real-provider CLIO benchmark into a 21-case hierarchical stress campaign with compaction, provider/model swap, dirty data, tier-3/nanoagent, NDP/SAC, surfaced-error, and visualization artifact coverage
  • hardens CLIO/GACT execution paths found during the campaign: transient-provider retries for expert dispatch/compaction, ARC-backed compaction memory with exact retained evidence, retained-context dispatch cleanup, explicit file precedence in analysis, nested NDP ownership, Parquet/statistical promotion to analysis, and visualization-intent promotion
  • regenerates docs/ALCF_DEMO_BENCHMARK_REPORT.md from saved evidence with provider/model/settings, route graph, handoffs, tools/results, child sessions, artifacts, timings, fixes, and caveats

Real-provider benchmark evidence

Command:

uv run python scripts/run_demo_benchmark.py --base-url http://127.0.0.1:17983 --case-delay-s 5 --require-stress-criteria --output-jsonl tmp/clio-demo-benchmark-alcf-metis-20260524-stress-final4.jsonl --report docs/ALCF_DEMO_BENCHMARK_REPORT.md

Result:

  • 21 cases total
  • 19 clean passes
  • 2 expected surfaced errors
  • 0 partial recoveries
  • 0 failures
  • stress audit passed: 10/10 complex demos, 6/5 long/high-event, 5/3 tier-3/nanoagent, 5/3 visualization artifacts, 2/2 surfaced errors, 1/1 compaction, 1/1 provider/model swap

Verification

  • uv run ruff format scripts/run_demo_benchmark.py
  • uv run ruff check scripts/run_demo_benchmark.py
  • uv run python -m py_compile scripts/run_demo_benchmark.py
  • uv run python scripts/run_demo_benchmark.py --render-existing-jsonl tmp/clio-demo-benchmark-alcf-metis-20260524-stress-final4.jsonl --report tmp/clio-demo-benchmark-report-render-check.md
  • uv run pytest tests/ -q -> 1180 passed, 37 skipped in 136.01s

Caveats

  • The JSONL evidence file lives under ignored tmp/; the committed markdown report references it and contains the summarized evidence.
  • bar_chart_status.png was generated locally by the run and left uncommitted as a benchmark artifact, not a source file.
  • ALCF latency/token freshness may differ on future runs; the report records the provider/model/settings observed for this run.

@JaimeCernuda JaimeCernuda force-pushed the feat/stress-benchmark-audit-20260524 branch 2 times, most recently from 974ffa9 to ce9f0a3 Compare May 24, 2026 09:00
@JaimeCernuda JaimeCernuda force-pushed the feat/stress-benchmark-audit-20260524 branch from ce9f0a3 to 0c12056 Compare May 24, 2026 09:09
@JaimeCernuda JaimeCernuda merged commit 058bb26 into develop May 24, 2026
1 check passed
@JaimeCernuda JaimeCernuda deleted the feat/stress-benchmark-audit-20260524 branch May 24, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant