iowarp · JaimeCernuda · May 24, 2026 · May 24, 2026
diff --git a/.gitignore b/.gitignore
@@ -45,6 +45,7 @@ tool_cache/
 
 # Tool-generated files
 tool_output/
+.clio-agent-artifacts/
 *.tool.log
 mcp_*.log
 

diff --git a/TASK.md b/TASK.md
@@ -1,5 +1,20 @@
 # CLIO/GACT Provider Selector Polish Tasks
 
+## Current Resume Anchor
+
+- [ ] Active benchmark objective: build a true hierarchical CLIO stress benchmark campaign, not another smoke suite. Target natural scientific prompts where the orchestrator delegates to scoped experts, experts delegate to tier-3/nanoagents, tools are owned by expert tags/visibility, and results flow through discovery/staging -> format inspection -> analysis -> visualization with explicit evidence, artifacts, timings, and surfaced errors. Do not hardcode benchmark-specific routes, prompts, or fallback answers.
+- [ ] Current branch: `feat/hierarchical-demo-runner-evidence-20260524`. Uncommitted work after the latest ALCF run: `scripts/run_demo_benchmark.py`, `docs/ALCF_DEMO_BENCHMARK_REPORT.md`, and generated artifact `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`. Inspect artifact size/path before deciding whether to commit it; avoid accidentally committing generated evidence unless intentional.
+- [x] Recently merged hierarchy slices into `develop`: PR #315 added registry hierarchy metadata and exposed tier-3 agents; PR #316 extracted executable `NDPExpert` and `SACFormatExpert`; PR #317 added `ExpertHandoff` trace metadata, GACT message metadata for `expert_handoffs`, and benchmark report support for observed handoff graphs. Full Python suite after #317: `1154 passed, 37 skipped`.
+- [x] Latest live ALCF/Metis evidence run used backend `http://127.0.0.1:17961`, provider `argonne`, API base `https://inference-api.alcf.anl.gov/resource_server/metis/api/v1`, model `gpt-oss-120b`, planner temperature `0`, max tokens `4096`, turn timeout `900s`, allowed root set to the repo. Smoke prompt through GACT answered "Paris is the capital of France." with `error_info=null`, confirming real ALCF inference through CLIO/GACT.
+- [x] Earlier demo runner execution on `17961` reported `15/15` using the old pass logic, but audit found this was too permissive because partial recovery metadata was counted as a normal pass. The current stricter run is the `17962` evidence below.
+- [x] Strongest current demo evidence: `ndp_seismic_waveform_to_plot` selected `visualization`, recorded handoffs `ndp_catalog`, `analysis`, `sac_format`, `visualization`, called NDP search/detail/stage tools plus `sac_inspect_archive`, `sac_compute_trace_statistics`, and `sac_plot_traces`, staged `Pachhai_etal_2023_ScP_data.tar`, found 11260 SAC traces, sampled/visualized traces, and wrote `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`.
+- [ ] Do not call the benchmark objective complete yet. The current 15-case run is useful evidence, but still too shallow for the original goal: most cases are short, only one case clearly has >10 combined tool/handoff events, no multi-minute stress case completed, no context-pressure/compaction case, no provider/model swap during active work, no large dirty data memory stress beyond current fixtures, and direct tool-action cases still lack rich handoff graph evidence.
+- [x] Inspected pass cases with `error_info` and tightened the benchmark runner: surfaced partial-recovery metadata is now outcome `partial`, not `pass`; expected missing-file errors remain `expected_error`.
+- [x] Direct planner-selected tool actions now record owning expert handoff events, not only nested expert dispatches. Current report shows direct HDF5/Parquet/visualization ownership evidence as counted handoff events.
+- [ ] Next benchmark expansion target: add deeper collaborator-grade prompts that stress hidden-task generality: EarthScope/NDP discovery with bounded waveform staging, local file search plus provider discovery, multi-format experiment audit, many parallel tool/nanoagent calls, context pressure/compaction, large-file refusal/memory safety, provider/model swap while work is active, and deliberate unavailable-resource/error-surfacing cases. The best 10 documented demos should be complex enough for external collaborators, not just 30-second route checks.
+- [ ] 2026-05-24 stricter ALCF rerun from current branch on `http://127.0.0.1:17962` with `CLIO_AGENT_MAX_STEPS=12` and `--case-delay-s 5`: `13/15` clean passes, `1` expected surfaced error, `1` partial recovery, `0` hard failures. HDF5 overview became clean after the higher step budget. Direct tool actions now record owner handoff events, so the report shows evidence such as `data x8`, `analysis x5`, and `visualization`. Remaining partial: `workflow_memory_followup` completed a Parquet schema observation and synthesized visible text, but planner continuation hit `litellm.RateLimitError: Tokens/minute limit exceeded`; report now labels it `partial`, not pass.
+- [ ] ALCF provider readiness reporting still has an inconsistency: `/v1/providers/lm` presets report Metis/Sophia `ready` with `Globus token validated`, while `/v1/health` reports LM `degraded` with `ALCF Globus token stored; validate before use`. The TUI/provider status should use the validated provider state, not stale conservative health text.
+
 ## Open Issues
 
 - [ ] Current ALCF demo benchmark is only a smoke/demo baseline, not a true CLIO hierarchical stress benchmark. Future benchmark work must target hierarchical intelligence: orchestrator -> scoped experts -> tier-3/nanoagents -> cross-expert result handoffs -> visible tool evidence -> artifacts/errors. Add and run complex workflows such as NDP seismic discovery -> staged dataset -> three-axis analysis -> visualization, mixed HDF5/BP5/Parquet/CSV experiment audit, dirty tabular quality review, context-pressure/compaction, large-file memory safety, provider/model swap during active work, and tool-ownership boundary tests. NDP discovery should be owned by `data` or a nested `ndp_catalog` agent, with `analysis` consuming discovered/staged data rather than directly owning NDP search. Benchmark completion requires at least ten human-demoable complex workflows, multiple >2 minute or >10-event runs, tier-3/nanoagent coverage, plotted artifacts, deliberate surfaced failures, and saved evidence for route/expert/tool/artifact/timing/error behavior. See `docs/HIERARCHICAL_STRESS_BENCHMARK_PLAN.md`.