iowarp · JaimeCernuda · May 24, 2026 · May 24, 2026
diff --git a/TASK.md b/TASK.md
@@ -10,6 +10,9 @@
   - [x] Nested expert framework slice: registry capabilities now support `parent_id`, `source`, and planner visibility; NDP catalog tools are owned by a tier-3 `ndp_catalog` child under `data`; SAC archive/statistics/plotting tools are owned by a tier-3 `sac_format` child under `analysis`; GACT `/v1/agents?tier=3` exposes these rows so the TUI can render hierarchy/details.
   - [x] Nested expert execution slice: extracted executable `NDPExpert` and `SACFormatExpert` classes with their own tool filters/result metadata, registered them as real child agents, and made Data/Analysis delegate to those child experts instead of directly owning NDP/SAC tool lists. Caveat: this is still one-process/shared-model execution; the long-term target remains independently configurable/finetunable child models and a more generic multi-level traversal policy.
   - [ ] Architecture warning for upcoming hidden user tests: do not mark benchmark-specific prompt/tool tuning as success. The framework must pass unseen scientific tasks by using generic registered experts, tool tags, artifact contracts, and visible handoffs. Avoid flat namespaces and route strings that only work for one named benchmark. The desired traversal is hierarchical: orchestrator -> data manager -> provider-specific discovery experts such as NDP/EarthScope/local search -> staging/download utility -> format-specific experts such as SAC -> analysis -> visualization. SAC tools should be a narrowly named SAC MCP/tool surface, not a generic seismic catch-all; NDP should become a nested expert with its own prompt/context and future model boundary. Benchmark prompts should be inspired by real scientific workflows, including clio-kit/NDP/EarthScope style tasks, but implementation must remain scalable and reusable.
+  - [ ] Compaction anchor for the next benchmark/hierarchy pass: the user will test hidden scientific tasks, so success requires a generic hierarchy, not a demo-tuned path. The target pattern is `orchestrator -> data -> ndp_catalog/EarthScope/local_search -> utility staging/download -> analysis -> sac_format or other format experts -> visualization`, with each expert receiving scoped context and returning explicit evidence/artifacts/errors. `data` should coordinate discovery and staging rather than directly owning every provider's domain semantics; `ndp_catalog` should live under `data` with NDP-specific prompt/context/tools and eventually an independent model boundary. A seismic workflow should be represented through format/provider specialists such as a SAC MCP/expert, not a broad hardcoded `seismic_server` that exists only for one benchmark.
+  - [ ] Benchmark design rule: complex prompts should be natural scientific requests, not instructions that spell out CLIO internals such as "spawn nanoagents" or "use NDP then SAC". Good benchmark cases should force the system to infer delegation, issue multiple tool calls, use nested experts, stage or reject real data, pass discoveries across expert boundaries, produce plots when appropriate, and surface failures honestly. Include stressors that mocks miss: parallel calls, many tool events, large/bad files, context pressure/compaction, provider/model swaps during active work, unavailable resources, permission boundaries, and hidden tool-ownership mistakes.
+  - [ ] NDP/EarthScope direction: prioritize real scientific data workflows, especially EarthScope/NDP-style seismic discovery. A representative target is: find a relevant seismic dataset through NDP/EarthScope, inspect candidate resources, stage a bounded waveform file, analyze three-component or SAC traces through a format-specific expert, and visualize the result. `clio-kit` should be treated as a reference for provider semantics and MCP integration, but CLIO should expose reusable expert/tool boundaries instead of cloning benchmark-specific prompts or routes.
 - [x] No-guard cross-file routing could still fail after a correct planner attempt. ALCF/gpt-oss selected/attempted `analysis` for the four-file triage, but `_expert_file_compatibility_error()` rejected the analysis coordinator because the current-file context held only the first HDF5 path and ignored `coordinated_file_suffixes`. The compatibility check now evaluates question + file context together and allows registered coordinator experts for natural multi-file bundles. Evidence: first 14-case demo run failed `reasoning_cross_file_triage_nanoagents` with repeated compatibility errors; focused unit coverage added; rerun passed with `route_source=dspy`, selected `analysis`, six tool calls, and four tier-3 child sessions.
 - [x] Natural HDF5 dataset prompts required tool-shaped wording. A prompt like "Focus on plasma/electron_temperature..." should call `hdf5_analyze_dataset` without the user naming the tool. DataExpert now treats named dataset paths plus natural focus/chunk/statistics language as dataset-level analysis. Evidence: `test_natural_dataset_focus_uses_dataset_tool` and ALCF demo case `hdf5_dataset_focus` passed with `hdf5_analyze_dataset`.
 - [x] Memory demo prompt was too weak and allowed a chat answer with no fresh evidence. The demo runner now asks CLIO to compute schema/statistics while relying on the prior Parquet path, producing a real memory + tool-use case. Evidence: first run failed `workflow_memory_followup` as `chat` with no tools; rerun passed as `analysis` with `parquet_analyze_schema` plus five `parquet_compute_statistics` calls.

diff --git a/docs/HIERARCHICAL_STRESS_BENCHMARK_PLAN.md b/docs/HIERARCHICAL_STRESS_BENCHMARK_PLAN.md
@@ -94,17 +94,21 @@ Current implementation evidence:
   `sac_compute_trace_statistics`, and `sac_plot_traces`.
 - The format tool surface is deliberately SAC-specific. It is exposed as a
   `sac` FastMCP server with `sac_*` tools, not as a generic seismic namespace.
+- CLIO now emits `expert_handoffs` metadata on GACT assistant messages and the
+  stress audit log. This is required evidence for staged workflows whose final
+  public route is `visualization` but whose work actually traversed `data`,
+  `ndp_catalog`, `analysis`, `sac_format`, and `visualization`.
 - Caveat: the completed staged waveform demo is SAC archive based. The original
   Salton Sea three-component MiniSEED path remains a future target because the
   discovered OSDF resource is large and requires a bounded Pelican/object
   selection path.
-- Architecture caveat: this implementation proves data-owned NDP discovery, but
-  NDP semantics still live inside the top-level DataExpert. The intended CLIO
-  hierarchy is `data -> ndp_catalog` or `data -> ndp_access`, where the nested
-  NDP expert owns NDP-specific prompt context, tools, dataset/resource ranking,
-  and eventually its own tuned model. Future benchmarks should include
-  EarthScope-oriented prompts and verify that NDP work is delegated to that
-  nested expert rather than handled directly by DataExpert.
+- Architecture caveat: `ndp_catalog` and `sac_format` are now executable nested
+  experts under `data` and `analysis`, respectively, but they still run in the
+  same process and usually share the same provider/model. The intended long-term
+  CLIO hierarchy keeps their prompt/context/tool surfaces separate and eventually
+  lets each nested expert use its own tuned model. Future benchmarks should
+  include EarthScope-oriented prompts and verify that NDP work is delegated to
+  `ndp_catalog` rather than handled directly by DataExpert.
 
 ### 2. Mixed Scientific Run Audit
 
@@ -233,7 +237,8 @@ Every benchmark run should save:
 - Prompt and scenario ID.
 - Provider/model/context settings.
 - Route decision and route source.
-- Expert handoff graph.
+- Expert handoff graph from `metadata.expert_handoffs`; final selected route is
+  not sufficient evidence for hierarchy.
 - Per-expert context summary.
 - Tool calls with arguments, results, errors, and duration.
 - Child/nanoagent sessions and their status.