diff --git a/.gitignore b/.gitignore
index 44f6538..bf7f920 100644
--- a/.gitignore
+++ b/.gitignore
@@ -45,6 +45,7 @@ tool_cache/
 
 # Tool-generated files
 tool_output/
+.clio-agent-artifacts/
 *.tool.log
 mcp_*.log
 
diff --git a/TASK.md b/TASK.md
index bee03e4..f956a9e 100644
--- a/TASK.md
+++ b/TASK.md
@@ -1,5 +1,20 @@
 # CLIO/GACT Provider Selector Polish Tasks
 
+## Current Resume Anchor
+
+- [ ] Active benchmark objective: build a true hierarchical CLIO stress benchmark campaign, not another smoke suite. Target natural scientific prompts where the orchestrator delegates to scoped experts, experts delegate to tier-3/nanoagents, tools are owned by expert tags/visibility, and results flow through discovery/staging -> format inspection -> analysis -> visualization with explicit evidence, artifacts, timings, and surfaced errors. Do not hardcode benchmark-specific routes, prompts, or fallback answers.
+- [ ] Current branch: `feat/hierarchical-demo-runner-evidence-20260524`. Uncommitted work after the latest ALCF run: `scripts/run_demo_benchmark.py`, `docs/ALCF_DEMO_BENCHMARK_REPORT.md`, and generated artifact `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`. Inspect artifact size/path before deciding whether to commit it; avoid accidentally committing generated evidence unless intentional.
+- [x] Recently merged hierarchy slices into `develop`: PR #315 added registry hierarchy metadata and exposed tier-3 agents; PR #316 extracted executable `NDPExpert` and `SACFormatExpert`; PR #317 added `ExpertHandoff` trace metadata, GACT message metadata for `expert_handoffs`, and benchmark report support for observed handoff graphs. Full Python suite after #317: `1154 passed, 37 skipped`.
+- [x] Latest live ALCF/Metis evidence run used backend `http://127.0.0.1:17961`, provider `argonne`, API base `https://inference-api.alcf.anl.gov/resource_server/metis/api/v1`, model `gpt-oss-120b`, planner temperature `0`, max tokens `4096`, turn timeout `900s`, allowed root set to the repo. Smoke prompt through GACT answered "Paris is the capital of France." with `error_info=null`, confirming real ALCF inference through CLIO/GACT.
+- [x] Earlier demo runner execution on `17961` reported `15/15` using the old pass logic, but audit found this was too permissive because partial recovery metadata was counted as a normal pass. The current stricter run is the `17962` evidence below.
+- [x] Strongest current demo evidence: `ndp_seismic_waveform_to_plot` selected `visualization`, recorded handoffs `ndp_catalog`, `analysis`, `sac_format`, `visualization`, called NDP search/detail/stage tools plus `sac_inspect_archive`, `sac_compute_trace_statistics`, and `sac_plot_traces`, staged `Pachhai_etal_2023_ScP_data.tar`, found 11260 SAC traces, sampled/visualized traces, and wrote `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`.
+- [ ] Do not call the benchmark objective complete yet. The current 15-case run is useful evidence, but still too shallow for the original goal: most cases are short, only one case clearly has >10 combined tool/handoff events, no multi-minute stress case completed, no context-pressure/compaction case, no provider/model swap during active work, no large dirty data memory stress beyond current fixtures, and direct tool-action cases still lack rich handoff graph evidence.
+- [x] Inspected pass cases with `error_info` and tightened the benchmark runner: surfaced partial-recovery metadata is now outcome `partial`, not `pass`; expected missing-file errors remain `expected_error`.
+- [x] Direct planner-selected tool actions now record owning expert handoff events, not only nested expert dispatches. Current report shows direct HDF5/Parquet/visualization ownership evidence as counted handoff events.
+- [ ] Next benchmark expansion target: add deeper collaborator-grade prompts that stress hidden-task generality: EarthScope/NDP discovery with bounded waveform staging, local file search plus provider discovery, multi-format experiment audit, many parallel tool/nanoagent calls, context pressure/compaction, large-file refusal/memory safety, provider/model swap while work is active, and deliberate unavailable-resource/error-surfacing cases. The best 10 documented demos should be complex enough for external collaborators, not just 30-second route checks.
+- [ ] 2026-05-24 stricter ALCF rerun from current branch on `http://127.0.0.1:17962` with `CLIO_AGENT_MAX_STEPS=12` and `--case-delay-s 5`: `13/15` clean passes, `1` expected surfaced error, `1` partial recovery, `0` hard failures. HDF5 overview became clean after the higher step budget. Direct tool actions now record owner handoff events, so the report shows evidence such as `data x8`, `analysis x5`, and `visualization`. Remaining partial: `workflow_memory_followup` completed a Parquet schema observation and synthesized visible text, but planner continuation hit `litellm.RateLimitError: Tokens/minute limit exceeded`; report now labels it `partial`, not pass.
+- [ ] ALCF provider readiness reporting still has an inconsistency: `/v1/providers/lm` presets report Metis/Sophia `ready` with `Globus token validated`, while `/v1/health` reports LM `degraded` with `ALCF Globus token stored; validate before use`. The TUI/provider status should use the validated provider state, not stale conservative health text.
+
 ## Open Issues
 
 - [ ] Current ALCF demo benchmark is only a smoke/demo baseline, not a true CLIO hierarchical stress benchmark. Future benchmark work must target hierarchical intelligence: orchestrator -> scoped experts -> tier-3/nanoagents -> cross-expert result handoffs -> visible tool evidence -> artifacts/errors. Add and run complex workflows such as NDP seismic discovery -> staged dataset -> three-axis analysis -> visualization, mixed HDF5/BP5/Parquet/CSV experiment audit, dirty tabular quality review, context-pressure/compaction, large-file memory safety, provider/model swap during active work, and tool-ownership boundary tests. NDP discovery should be owned by `data` or a nested `ndp_catalog` agent, with `analysis` consuming discovered/staged data rather than directly owning NDP search. Benchmark completion requires at least ten human-demoable complex workflows, multiple >2 minute or >10-event runs, tier-3/nanoagent coverage, plotted artifacts, deliberate surfaced failures, and saved evidence for route/expert/tool/artifact/timing/error behavior. See `docs/HIERARCHICAL_STRESS_BENCHMARK_PLAN.md`.
diff --git a/docs/ALCF_DEMO_BENCHMARK_REPORT.md b/docs/ALCF_DEMO_BENCHMARK_REPORT.md
index cd7daed..7e20058 100644
--- a/docs/ALCF_DEMO_BENCHMARK_REPORT.md
+++ b/docs/ALCF_DEMO_BENCHMARK_REPORT.md
@@ -1,28 +1,29 @@
 # CLIO ALCF Demo Benchmark Report
 
-Generated: 2026-05-23 21:30:35 Central Daylight Time
-Evidence JSONL: `D:\Libraries\Documents\projects\clio-agent\tmp\clio-demo-benchmark-hierarchical-ndp-data-v3.jsonl`
+Generated: 2026-05-24 00:12:33 Central Daylight Time
+Evidence JSONL: `D:\Libraries\Documents\projects\clio-agent\tmp\clio-demo-benchmark-alcf-metis-handoff.jsonl`
 
-Result: 14/14 cases passed.
+Result: 13/15 clean passes, 1 expected surfaced errors, 1 partial recoveries, 0 failures.
 
 ## All Cases
 
-| Case | Category | Mode | Source | Pass | Agent | Tools | Children | Elapsed |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| workflow_hdf5_overview | tooling | auto | dspy | yes | data | hdf5_analyze_file, hdf5_list_datasets | 0 | 17.6s |
-| workflow_parquet_profile | analysis | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 20.1s |
-| workflow_memory_followup | memory | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 22.2s |
-| workflow_csv_event_schema | analysis | auto | dspy | yes | analysis | csv_read_table | 0 | 10.1s |
-| workflow_visual_dashboard | visualization | auto | dspy | yes | visualization | plot_summary | 0 | 8.1s |
-| hdf5_dataset_focus | tooling | auto | dspy | yes | data | hdf5_analyze_dataset | 0 | 7.1s |
-| cross_file_triage_nanoagents | multi-agent | auto | guard | yes | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 0.5s |
-| reasoning_cross_file_triage_nanoagents | planner-hardening | reasoning_only | dspy | yes | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 2.5s |
-| adios_bp5_container | tooling | auto | guard | yes | data | adios_inspect_file | 0 | 0.5s |
-| reasoning_adios_bp5_container | planner-hardening | reasoning_only | dspy | yes | data | adios_inspect_file | 0 | 6.5s |
-| dirty_parquet_quality | analysis | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 2.5s |
-| ndp_catalog_discovery | external-catalog | auto | dspy | yes | data | ndp_list_organizations, ndp_search_datasets | 0 | 8.5s |
-| visual_scatter_artifact | visualization | auto | dspy | yes | visualization | plot_scatter | 0 | 6.1s |
-| missing_hdf5_error | hardening | auto | dspy | yes | data | hdf5_list_datasets | 0 | 4.5s |
+| Case | Category | Mode | Source | Outcome | Agent | Handoffs | Tools | Children | Elapsed |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| workflow_hdf5_overview | tooling | auto | dspy | pass | data | data x8 | hdf5_analyze_file, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset | 0 | 13.1s |
+| workflow_parquet_profile | analysis | auto | dspy | pass | analysis | analysis x5 | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 5.1s |
+| workflow_memory_followup | memory | auto | dspy | partial | analysis | analysis | parquet_analyze_schema | 0 | 36.3s |
+| workflow_csv_event_schema | analysis | auto | dspy | pass | analysis | analysis | csv_read_table | 0 | 3.5s |
+| workflow_visual_dashboard | visualization | auto | dspy | pass | visualization | visualization | plot_summary | 0 | 3.1s |
+| hdf5_dataset_focus | tooling | auto | dspy | pass | data | data | hdf5_analyze_dataset | 0 | 2.5s |
+| cross_file_triage_nanoagents | multi-agent | auto | guard | pass | analysis | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 0.5s |
+| reasoning_cross_file_triage_nanoagents | planner-hardening | reasoning_only | dspy | pass | analysis | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 1.0s |
+| adios_bp5_container | tooling | auto | guard | pass | data | data | adios_inspect_file | 0 | 0.5s |
+| reasoning_adios_bp5_container | planner-hardening | reasoning_only | dspy | pass | data | data | adios_inspect_file | 0 | 2.5s |
+| dirty_parquet_quality | analysis | auto | dspy | pass | analysis | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 1.5s |
+| ndp_catalog_discovery | external-catalog | auto | dspy | pass | ndp_catalog | ndp_catalog | ndp_list_organizations, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource | 0 | 11.1s |
+| ndp_seismic_waveform_to_plot | hierarchical-science | auto | dspy | pass | visualization | ndp_catalog, analysis, sac_format, visualization | ndp_list_organizations, ndp_search_datasets, ndp_search_datasets, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource, sac_inspect_archive, sac_compute_trace_statistics, sac_plot_traces | 0 | 20.6s |
+| visual_scatter_artifact | visualization | auto | dspy | pass | visualization | visualization | plot_scatter | 0 | 2.0s |
+| missing_hdf5_error | hardening | auto | dspy | expected_error | data | data | hdf5_list_datasets | 0 | 2.5s |
 
 ## Best 10 Demo Prompts
 
@@ -33,6 +34,7 @@ Category: multi-agent
 Routing mode: `auto`
 Status: pass
 Selected agent: `analysis`
+Expert handoffs: analysis
 Tools: hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table
 Child sessions: csv_validator subagent, analysis_validator subagent, adios_validator subagent, data_validator subagent
 Artifacts: none
@@ -65,7 +67,7 @@ Inspected HDF5 file D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmar
 
 Compression summary: 5 compressed, 2 uncompressed. Overall raw-to-stored ratio is about 1.12x.
 
-Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf c
+Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf
 ```
 
 ### 2. No-guard cross-file triage
@@ -75,10 +77,11 @@ Category: planner-hardening
 Routing mode: `reasoning_only`
 Status: pass
 Selected agent: `analysis`
+Expert handoffs: analysis
 Tools: hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table
 Child sessions: csv_validator subagent, analysis_validator subagent, adios_validator subagent, data_validator subagent
 Artifacts: none
-Elapsed: 2.5s
+Elapsed: 1.0s
 
 Prompt:
 
@@ -107,77 +110,75 @@ Inspected HDF5 file D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmar
 
 Compression summary: 5 compressed, 2 uncompressed. Overall raw-to-stored ratio is about 1.12x.
 
-Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf c
+Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf
 ```
 
-### 3. Memory follow-up without repeating path
+### 3. NDP seismic waveform discovery to plot
 
-Case: `workflow_memory_followup`
-Category: memory
+Case: `ndp_seismic_waveform_to_plot`
+Category: hierarchical-science
 Routing mode: `auto`
 Status: pass
-Selected agent: `analysis`
-Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics
+Selected agent: `visualization`
+Expert handoffs: ndp_catalog, analysis, sac_format, visualization
+Tools: ndp_list_organizations, ndp_search_datasets, ndp_search_datasets, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource, sac_inspect_archive, sac_compute_trace_statistics, sac_plot_traces
 Child sessions: none
-Artifacts: none
-Elapsed: 22.2s
+Artifacts: D:\Libraries\Documents\projects\clio-agent\.clio-agent-artifacts\charts\sac_traces_Pachhai_etal_2023_ScP_data.png
+Elapsed: 20.6s
 
 Prompt:
 
 ```text
-Based on the Parquet file we just profiled, compute whatever schema or column statistics you need for a quick anomaly triage view. Do not ask me for the path again.
+Find a bounded seismic waveform dataset from a seismological or Earth-science organization in the National Data Platform. Choose a usable resource, stage it if it is small enough, inspect the waveform content, compute representative trace statistics, and produce a plot artifact. If a candidate is too large or unavailable, surface that as the result instead of inventing a plot.
 ```
 
-What to see: CLIO resolves the previously profiled Parquet file from session context.
+What to see: CLIO delegates NDP discovery to ndp_catalog, stages a bounded waveform resource, analyzes SAC traces through sac_format, and creates a PNG plot.
 
-Why this is interesting: Demonstrates session memory and current-file resolution instead of copy/paste paths.
+Why this is interesting: This is the core hierarchical science demo: provider discovery, data access, format-specific analysis, and visualization without the user naming internal agents.
 
 Observed excerpt:
 
 ```text
-**Parquet file:** `D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet`
-
-| Property | Value |
-|----------|-------|
-| **Rows** | 3 000 |
-| **Row groups** | 8 |
-| **File size** | 154 818 bytes |
-| **Created by** | parquet‑cpp‑arrow v22.0.0 |
-| **Key‑value metadata** | `scenario = clio-stress-benchmark`<br>`source = deterministic` |
-| **Columns (10 total, 4 shown below)** | `sample_id (int64, nullable)`<br>`run_id (string, nullable)`<br>`site (string, nullable)`<br>`temperature_k (double, nullable)`<br>`pressure_pa (double, nullable)`<br>`humidity_pct (double, nullable)`<br>`vibration_mm_s (double, nullable)`<br>`anomaly_score (double, nullable)`<br>… (2 more columns not listed) |
+Data stage:
+Queried the National Data Platform catalog through clio-kit MCP.
 
-### Column‑level statistics (useful for anomaly triage)
+Organizations matched: 0
 
-| Column | Min | Max | Mean | Std‑dev | Median | Nulls | Unique values |
-|--------|-----|-----|------|-------
+Datasets matched: 9
+- Data for: Improved Characterization of Ultralow-velocity Zones through Advances in Bayesian Inversion of ScP Waveforms (marriott_library_hive; PDF, TAR, TXT; resources: Pachhai_etal_2023_ScP_data.tar)
+- Data for: A Compositional Component to the Samoa Ultralow-velocity Zone Revealed through 2- and 3-D  Waveform Modeling of SKS and SKKS Differential Travel-times and Amplitudes (marriott_library_hive; GZ, TXT; resources: 2021_Samoa_Data.tar.gz)
+- Salton Sea Seismic Data (ucr-earth-and-planetary-sciences; formats not listed; resources: Salton Sea Seismic Waveforms)
+- USGS_LPC_VA_Central_Seismic_2013_LAS_2015 (usgs-3dep; GEOJSON; resources: spatial_extents_USGS_LPC_VA_Central_Seismic_2013_LAS_2015.json)
+- Central U.S. ARRA Lidar, New Madrid Seismic Zone (opentopogra
 ```
 
-### 4. Targeted scatter plot
+### 4. HDF5 fusion file overview
 
-Case: `visual_scatter_artifact`
-Category: visualization
+Case: `workflow_hdf5_overview`
+Category: tooling
 Routing mode: `auto`
 Status: pass
-Selected agent: `visualization`
-Tools: plot_scatter
+Selected agent: `data`
+Expert handoffs: data x8
+Tools: hdf5_analyze_file, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset
 Child sessions: none
-Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_scatter.png, D:\\Libraries\\Documents\\projects\\clio-agent\\tmp\\clio-benchmark-data\\facility_measurements_scatter.png
-Elapsed: 6.1s
+Artifacts: none
+Elapsed: 13.1s
 
 Prompt:
 
 ```text
-Create a scatter plot from D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet with vibration_mm_s on the x-axis and anomaly_score on the y-axis. Save it as a PNG and explain what relationship the plot is meant to reveal.
+I need to brief collaborators on this fusion output: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\fusion_run.h5. What datasets are inside, what shapes and units matter, and what compression details should I mention?
 ```
 
-What to see: Visualization expert chooses plot_scatter and saves a PNG artifact.
+What to see: Data expert uses HDF5 tools and summarizes datasets, units, and compression.
 
-Why this is interesting: Checks whether a specific visualization intent maps to the right chart tool.
+Why this is interesting: Proves real HDF5 path handling, tool argument generation, and grounded synthesis.
 
 Observed excerpt:
 
 ```text
-The scatter plot (saved at D:\\Libraries\\Documents\\projects\\clio-agent\\tmp\\clio-benchmark-data\\facility_measurements_scatter.png) visualizes the relationship between vibration_mm_s (x‑axis) and anomaly_score (y‑axis), showing whether higher vibration values tend to correspond to higher anomaly scores.
+The file contains 7 datasets: axes/diagnostic_channel (12‑int32, no compression), axes/radius_norm (64‑float64, no compression), axes/time_hours (96‑float64, gzip‑4, chunk 24), diagnostics/heat_flux (96×12‑float32, gzip‑5, units MW/m²), plasma/density (96×64‑float64, gzip‑4, units m⁻³), plasma/electron_temperature (96×64‑float32, gzip‑6, units eV), quality/flags (96×12‑int8, gzip‑4); overall 5 of 7 datasets are gzip‑compressed with overall compression ratio ~1.12.
 ```
 
 ### 5. Parquet facility profile
@@ -187,10 +188,11 @@ Category: analysis
 Routing mode: `auto`
 Status: pass
 Selected agent: `analysis`
+Expert handoffs: analysis x5
 Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics
 Child sessions: none
 Artifacts: none
-Elapsed: 20.1s
+Elapsed: 5.1s
 
 Prompt:
 
@@ -205,20 +207,63 @@ Why this is interesting: Checks statistical tool calls and model feedback from m
 Observed excerpt:
 
 ```text
-The file has 10 columns (including temperature_k, pressure_pa, humidity_pct, anomaly_score), 3000 rows, 8 row groups, and the columns show plausible ranges: temperature 267–321 K, pressure 98 kPa–104 kPa, humidity 4–96 %, anomaly_score 0.06–6.35 (mean 1.6), indicating sane data.
+The Parquet file has 10 columns (sample_id int64, run_id string, site string, temperature_k double, pressure_pa double, humidity_pct double, vibration_mm_s double, anomaly_score double, plus two others), 3000 rows across 8 row groups, and the temperature, pressure, humidity, and anomaly_score statistics fall within expected physical ranges, so they appear sane.
+```
+
+### 6. NDP catalog discovery
+
+Case: `ndp_catalog_discovery`
+Category: external-catalog
+Routing mode: `auto`
+Status: pass
+Selected agent: `ndp_catalog`
+Expert handoffs: ndp_catalog
+Tools: ndp_list_organizations, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource
+Child sessions: none
+Artifacts: none
+Elapsed: 11.1s
+
+Prompt:
+
+```text
+Find a few NOAA or climate-related datasets in the National Data Platform catalog that might complement this facility data. Summarize what you found and what I should verify before download.
+```
+
+What to see: Data expert delegates discovery to NDP tools through the CLIO gateway.
+
+Why this is interesting: Exercises external catalog discovery as a data-stage capability, before analysis consumes staged data.
+
+Observed excerpt:
+
+```text
+Queried the National Data Platform catalog through clio-kit MCP.
+
+Organizations matched: 3
+- noaa-global-systems-laboratory
+- noaa-national-centers-for-environmental-information-ncei
+- noaa-demo
+
+Datasets matched: 4
+- Full Climate Connectivity Network (california-landscape-metrics; HTML, SHP, WFS, WMS; resources: [WMS] Full Climate Connectivity Network)
+- Climate Refugia - Baseline (Historical) 1981 - 2010 (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Climate Refugia - Baseline (Historical) 1981 - 2010)
+- Change in Average Climatic Water Deficit (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Change in Average Climatic Water Deficit)
+- Environment Canada Climate Data (ndp-admin-ingest; GEOJSON, CSV; resources: Environment Canada Climate Data)
+
+Staging note: CLIO staged the selected NDP resource at D:\Libraries\Documents\projects\clio-
 ```
 
-### 6. Dirty Parquet quality review
+### 7. Dirty Parquet quality review
 
 Case: `dirty_parquet_quality`
 Category: analysis
 Routing mode: `auto`
 Status: pass
 Selected agent: `analysis`
+Expert handoffs: analysis
 Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics
 Child sessions: none
 Artifacts: none
-Elapsed: 2.5s
+Elapsed: 1.5s
 
 Prompt:
 
@@ -247,35 +292,7 @@ site: null_count=0, unique_count=6
 temperature_k: min=267.0871843867271, max=320.74640790864, mean=293.98862688749205, median=294.1027798559147, std=7.517766411153003, null_count=18, unique_count=2982
 pressure_pa: min=98435.39430602924, max=103998.62554169081, mean=101230.9576727976, median=101229.54620674164, std=767.0242855536812, null_count=14, unique_count=2986
 
-Recommendatio
-```
-
-### 7. HDF5 fusion file overview
-
-Case: `workflow_hdf5_overview`
-Category: tooling
-Routing mode: `auto`
-Status: pass
-Selected agent: `data`
-Tools: hdf5_analyze_file, hdf5_list_datasets
-Child sessions: none
-Artifacts: none
-Elapsed: 17.6s
-
-Prompt:
-
-```text
-I need to brief collaborators on this fusion output: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\fusion_run.h5. What datasets are inside, what shapes and units matter, and what compression details should I mention?
-```
-
-What to see: Data expert uses HDF5 tools and summarizes datasets, units, and compression.
-
-Why this is interesting: Proves real HDF5 path handling, tool argument generation, and grounded synthesis.
-
-Observed excerpt:
-
-```text
-The HDF5 file holds seven datasets—axes/diagnostic_channel (12 int32), axes/radius_norm (64 float64), axes/time_hours (96 float64), diagnostics/heat_flux (96×12 float32, units MW/m²), plasma/density (96×64 float64, units m⁻³), plasma/electron_temperature (96×64 float32, units eV), and quality/flags (96×12 int8)—with five of them compressed, giving a total stored size of 72 331 bytes versus 80 816 bytes raw (overall compression ratio ≈1.12).
+Recommendati
 ```
 
 ### 8. Follow-up visualization artifact
@@ -285,10 +302,11 @@ Category: visualization
 Routing mode: `auto`
 Status: pass
 Selected agent: `visualization`
+Expert handoffs: visualization
 Tools: plot_summary
 Child sessions: none
 Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png
-Elapsed: 8.1s
+Elapsed: 3.1s
 
 Prompt:
 
@@ -303,75 +321,89 @@ Why this is interesting: Shows multi-turn handoff from analysis to visualization
 Observed excerpt:
 
 ```text
-The dashboard was saved to D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png and provides a compact 2×2 summary of the facility_measurements Parquet file, visualizing the key columns (temperature, pressure, humidity, and anomaly_score) with histograms and overview plots of their distributions and relationships.
+The dashboard was saved to D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png and it presents a 2×2 summary of the facility measurements dataset, visualizing key numeric columns such as temperature, pressure, humidity, and anomaly score.
 ```
 
-### 9. NDP catalog discovery
+### 9. Targeted scatter plot
 
-Case: `ndp_catalog_discovery`
-Category: external-catalog
+Case: `visual_scatter_artifact`
+Category: visualization
 Routing mode: `auto`
 Status: pass
-Selected agent: `data`
-Tools: ndp_list_organizations, ndp_search_datasets
+Selected agent: `visualization`
+Expert handoffs: visualization
+Tools: plot_scatter
 Child sessions: none
-Artifacts: none
-Elapsed: 8.5s
+Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_scatter.png
+Elapsed: 2.0s
 
 Prompt:
 
 ```text
-Find a few NOAA or climate-related datasets in the National Data Platform catalog that might complement this facility data. Summarize what you found and what I should verify before download.
+Create a scatter plot from D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet with vibration_mm_s on the x-axis and anomaly_score on the y-axis. Save it as a PNG and explain what relationship the plot is meant to reveal.
 ```
 
-What to see: Data expert delegates discovery to NDP tools through the CLIO gateway.
+What to see: Visualization expert chooses plot_scatter and saves a PNG artifact.
 
-Why this is interesting: Exercises external catalog discovery as a data-stage capability, before analysis consumes staged data.
+Why this is interesting: Checks whether a specific visualization intent maps to the right chart tool.
 
 Observed excerpt:
 
 ```text
-Queried the National Data Platform catalog through clio-kit MCP.
-
-Organizations matched: 3
-- noaa-global-systems-laboratory
-- noaa-national-centers-for-environmental-information-ncei
-- noaa-demo
-
-Datasets matched: 4
-- Full Climate Connectivity Network (california-landscape-metrics; HTML, SHP, WFS, WMS; resources: [WMS] Full Climate Connectivity Network)
-- Environment Canada Climate Data (ndp-admin-ingest; GEOJSON, CSV; resources: Environment Canada Climate Data)
-- Climate Refugia - Baseline (Historical) 1981 - 2010 (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Climate Refugia - Baseline (Historical) 1981 - 2010)
-- Change in Average Climatic Water Deficit (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Change in Average Climatic Water Deficit)
-
-Recommendations:
-Treat these as discovery results owned by the data stage. Use ndp_get_datase
+The scatter plot (facility_measurements_scatter.png) visualizes vibration_mm_s versus anomaly_score, revealing whether higher vibration levels correspond to higher anomaly scores—typically a positive trend indicates that increased vibration is associated with greater anomaly likelihood.
 ```
 
-### 10. No-guard ADIOS/BP5 route
+### 10. Memory follow-up without repeating path
 
-Case: `reasoning_adios_bp5_container`
-Category: planner-hardening
-Routing mode: `reasoning_only`
-Status: pass
-Selected agent: `data`
-Tools: adios_inspect_file
+Case: `workflow_memory_followup`
+Category: memory
+Routing mode: `auto`
+Status: partial
+Selected agent: `analysis`
+Expert handoffs: analysis
+Tools: parquet_analyze_schema
 Child sessions: none
 Artifacts: none
-Elapsed: 6.5s
+Elapsed: 36.3s
 
 Prompt:
 
 ```text
-This ADIOS BP5 output came from a Gray-Scott run: "D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\gray scott noise 0.01 data.bp5". Tell me what the container looks like, whether profiling metadata is present, and what extra runtime is needed if variable-level metadata is unavailable.
+Based on the Parquet file we just profiled, compute whatever schema or column statistics you need for a quick anomaly triage view. Do not ask me for the path again.
 ```
 
-What to see: With routing guards disabled, the planner still selects the data expert.
+What to see: CLIO resolves the previously profiled Parquet file from session context.
 
-Why this is interesting: Checks that BP5 routing is not only a hardcoded suffix guard behavior.
+Why this is interesting: Demonstrates session memory and current-file resolution instead of copy/paste paths.
 
 Observed excerpt:
 
 ```text
-The BP5 container is a directory containing five files (data.0, md.0, md.idx, mmd.0, profiling.json) totaling about 84 MB, it includes profiling metadata (profiling.json), and you need to install the ADIOS2 Python bindings/runtime to read variable‑level metadata.
+**Facility Measurements – Quick Anomaly‑Triaging Summary**
+
+| Item | Value |
+|------|-------|
+| **File** | `D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet` |
+| **Rows** | 3 000 |
+| **Row groups** | 8 |
+| **File size** | 154 818 bytes |
+| **Created by** | parquet‑cpp‑arrow v22.0.0 |
+| **Key‑value metadata** | `scenario = clio-stress-benchmark`<br>`source = deterministic` |
+
+### Schema (10 columns)
+
+| Column | Type | Nullable | Role for Anomaly Triage |
+|--------|------|----------|--------------------------|
+| `sample_id` | int64 | Yes | Identifier |
+| `run_id` | string | Yes | Identifier |
+| `site` | string | Yes | Identifier / grouping |
+| `temperature_k` | double | Yes | Primary sensor metric |
+| `pressure_pa` | double | Yes | Primary sensor metric |
+| `humidity_pct` | double | Yes | Primary sensor metric |
+| `vibration_mm_s` | double
 ```
+
+## Partial Recovery Caveats
+
+- `workflow_memory_followup`: Agent planner failed after completed tool observations.
+  stage=post_observation_planning, tools=parquet_analyze_schema
diff --git a/scripts/run_demo_benchmark.py b/scripts/run_demo_benchmark.py
index c96d5b5..20576bb 100644
--- a/scripts/run_demo_benchmark.py
+++ b/scripts/run_demo_benchmark.py
@@ -36,9 +36,10 @@ class DemoCase:
     expected: str
     session_group: str
     timeout_s: float = 480.0
-    expected_agent: str = ""
+    expected_agent: str | tuple[str, ...] = ""
     expected_tool_prefixes: tuple[str, ...] = ()
     expected_tools: tuple[str, ...] = ()
+    expected_handoff_agents: tuple[str, ...] = ()
     expected_terms: tuple[str, ...] = ()
     min_children: int = 0
     expects_error: bool = False
@@ -82,23 +83,62 @@ def artifacts(self) -> list[str]:
         """Return artifact path candidates found in tools or text."""
         return _artifact_paths(self.message)
 
+    @property
+    def expert_handoffs(self) -> list[dict[str, Any]]:
+        """Return expert handoff provenance metadata."""
+        return _expert_handoffs(self.message)
+
+    @property
+    def handoff_agent_ids(self) -> list[str]:
+        """Return unique expert IDs observed in handoff provenance."""
+        seen: set[str] = set()
+        ids: list[str] = []
+        for row in self.expert_handoffs:
+            agent_id = str(row.get("agent_id") or "")
+            if not agent_id or agent_id in seen:
+                continue
+            seen.add(agent_id)
+            ids.append(agent_id)
+        return ids
+
+    @property
+    def handoff_event_count(self) -> int:
+        """Return the number of recorded expert handoff events."""
+        return len(self.expert_handoffs)
+
     @property
     def blocking_error(self) -> dict[str, Any] | None:
-        """Return blocking error_info, allowing documented partial recovery."""
+        """Return error_info that should fail or satisfy error-focused cases."""
         return _blocking_error(self.message)
 
+    @property
+    def partial_error(self) -> dict[str, Any] | None:
+        """Return surfaced partial-recovery metadata, if this turn had any."""
+        return _partial_error(self.message)
+
     @property
     def passed(self) -> bool:
         """Return whether this case satisfied its declared expectations."""
         if self.case.expects_error:
             return self.blocking_error is not None and not self.text.strip()
-        if self.blocking_error is not None:
+        if self.partial_error is not None:
             return False
-        if self.case.expected_agent and self.selected_agent != self.case.expected_agent:
+        if self.blocking_error is not None:
             return False
+        if self.case.expected_agent:
+            expected_agents = (
+                (self.case.expected_agent,)
+                if isinstance(self.case.expected_agent, str)
+                else self.case.expected_agent
+            )
+            if self.selected_agent not in expected_agents:
+                return False
         for expected_tool in self.case.expected_tools:
             if expected_tool not in self.tool_names:
                 return False
+        for expected_agent in self.case.expected_handoff_agents:
+            if expected_agent not in self.handoff_agent_ids:
+                return False
         for prefix in self.case.expected_tool_prefixes:
             if not any(name.startswith(prefix) for name in self.tool_names):
                 return False
@@ -110,12 +150,22 @@ def passed(self) -> bool:
             return False
         return True
 
+    @property
+    def outcome(self) -> str:
+        """Return a human-readable outcome category."""
+        if self.case.expects_error:
+            return "expected_error" if self.passed else "fail"
+        if self.partial_error is not None:
+            return "partial"
+        return "pass" if self.passed else "fail"
+
     @property
     def complexity_score(self) -> int:
         """Score cases for the best-demo report."""
         return (
             len(set(self.tool_names)) * 3
             + len(self.tools)
+            + len(self.expert_handoffs) * 4
             + len(self.child_sessions) * 6
             + len(self.artifacts) * 4
             + len(self.case.complexity_tags) * 2
@@ -145,25 +195,36 @@ def _tools(message: dict[str, Any]) -> list[dict[str, Any]]:
     return rows if isinstance(rows, list) else []
 
 
+def _expert_handoffs(message: dict[str, Any]) -> list[dict[str, Any]]:
+    rows = (message.get("metadata") or {}).get("expert_handoffs") or []
+    return rows if isinstance(rows, list) else []
+
+
 def _tool_name(row: dict[str, Any]) -> str:
     return str(row.get("name") or row.get("tool") or "")
 
 
 def _blocking_error(message: dict[str, Any]) -> dict[str, Any] | None:
+    error_info = message.get("error_info")
+    if not isinstance(error_info, dict):
+        return None
+    if _partial_error(message) is not None:
+        return None
+    return error_info
+
+
+def _partial_error(message: dict[str, Any]) -> dict[str, Any] | None:
     error_info = message.get("error_info")
     if not isinstance(error_info, dict):
         return None
     details = error_info.get("details")
-    if (
-        isinstance(details, dict)
-        and details.get("partial") is True
-        and details.get("stage")
-        in {
-            "post_observation_planning",
-            "parallel_validation_recovery",
-            "step_limit_after_observations",
-        }
-    ):
+    if not isinstance(details, dict) or details.get("partial") is not True:
+        return None
+    if details.get("stage") not in {
+        "post_observation_planning",
+        "parallel_validation_recovery",
+        "step_limit_after_observations",
+    }:
         return None
     return error_info
 
@@ -240,13 +301,18 @@ def _case_row(result: DemoResult) -> dict[str, Any]:
         "session_id": result.session_id,
         "elapsed_s": round(result.elapsed_s, 3),
         "passed": result.passed,
+        "outcome": result.outcome,
         "selected_agent": result.selected_agent,
         "routing_decision": _routing_decision(result.message),
         "tools_called": result.tools,
         "tool_names": result.tool_names,
+        "expert_handoffs": result.expert_handoffs,
+        "handoff_agent_ids": result.handoff_agent_ids,
+        "handoff_event_count": result.handoff_event_count,
         "child_sessions": result.child_sessions,
         "artifacts": result.artifacts,
         "error_info": result.message.get("error_info"),
+        "partial_error": result.partial_error,
         "stop_reason": result.message.get("stop_reason"),
         "provider": result.provider,
         "routing_mode": result.case.routing_mode,
@@ -465,8 +531,9 @@ def _make_cases(manifest: dict[str, Any]) -> list[DemoCase]:
             title="NDP catalog discovery",
             category="external-catalog",
             session_group="ndp",
-            expected_agent="data",
+            expected_agent=("data", "ndp_catalog"),
             expected_tool_prefixes=("ndp_",),
+            expected_handoff_agents=("ndp_catalog",),
             expected_terms=("dataset",),
             timeout_s=620.0,
             complexity_tags=("ndp", "clio-kit", "external-mcp"),
@@ -481,6 +548,42 @@ def _make_cases(manifest: dict[str, Any]) -> list[DemoCase]:
                 "analysis consumes staged data."
             ),
         ),
+        DemoCase(
+            case_id="ndp_seismic_waveform_to_plot",
+            title="NDP seismic waveform discovery to plot",
+            category="hierarchical-science",
+            session_group="ndp_seismic",
+            expected_agent=("visualization", "analysis", "data", "ndp_catalog"),
+            expected_tool_prefixes=("ndp_", "sac_"),
+            expected_handoff_agents=("ndp_catalog", "sac_format"),
+            expected_terms=("SAC", ".png"),
+            timeout_s=900.0,
+            complexity_tags=(
+                "ndp",
+                "earthscience",
+                "tier-3",
+                "sac",
+                "data-analysis-visualization",
+                "artifact",
+            ),
+            prompt=(
+                "Find a bounded seismic waveform dataset from a seismological or "
+                "Earth-science organization in the National Data Platform. Choose a usable "
+                "resource, stage it if it is small enough, inspect the waveform content, "
+                "compute representative trace statistics, and produce a plot artifact. If a "
+                "candidate is too large or unavailable, surface that as the result instead "
+                "of inventing a plot."
+            ),
+            expected=(
+                "CLIO delegates NDP discovery to ndp_catalog, stages a bounded waveform "
+                "resource, analyzes SAC traces through sac_format, and creates a PNG plot."
+            ),
+            why=(
+                "This is the core hierarchical science demo: provider discovery, data "
+                "access, format-specific analysis, and visualization without the user naming "
+                "internal agents."
+            ),
+        ),
         DemoCase(
             case_id="visual_scatter_artifact",
             title="Targeted scatter plot",
@@ -536,7 +639,14 @@ def _create_sessions(http: httpx.Client, cases: list[DemoCase]) -> dict[str, str
     return session_ids
 
 
-def run_benchmark(base_url: str, data_dir: Path, output_jsonl: Path, report_path: Path) -> int:
+def run_benchmark(
+    base_url: str,
+    data_dir: Path,
+    output_jsonl: Path,
+    report_path: Path,
+    *,
+    case_delay_s: float = 0.0,
+) -> int:
     """Run demo cases and write JSONL plus a markdown report."""
     manifest = create_benchmark_data(data_dir)
     cases = _make_cases(manifest)
@@ -573,20 +683,25 @@ def run_benchmark(base_url: str, data_dir: Path, output_jsonl: Path, report_path
                 results.append(result)
                 log.write(json.dumps(_case_row(result), ensure_ascii=False, default=str) + "\n")
                 log.flush()
-                status = "PASS" if result.passed else "FAIL"
+                status = result.outcome.upper()
                 print(
                     f"  {status} agent={result.selected_agent or '-'} "
                     f"tools={','.join(result.tool_names) or '-'} "
                     f"children={len(result.child_sessions)} elapsed={elapsed_s:.1f}s",
                     flush=True,
                 )
+                if case_delay_s > 0 and index < len(cases):
+                    time.sleep(case_delay_s)
 
     report_path.write_text(_render_report(results, output_jsonl), encoding="utf-8")
     return 0 if all(result.passed for result in results) else 1
 
 
 def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
-    passed = sum(1 for result in results if result.passed)
+    clean_passes = sum(1 for result in results if result.outcome == "pass")
+    expected_errors = sum(1 for result in results if result.outcome == "expected_error")
+    partials = sum(1 for result in results if result.outcome == "partial")
+    failures = sum(1 for result in results if result.outcome == "fail")
     best = sorted(results, key=lambda result: result.complexity_score, reverse=True)[:10]
     lines = [
         "# CLIO ALCF Demo Benchmark Report",
@@ -594,12 +709,16 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
         f"Generated: {time.strftime('%Y-%m-%d %H:%M:%S %Z')}",
         f"Evidence JSONL: `{output_jsonl}`",
         "",
-        f"Result: {passed}/{len(results)} cases passed.",
+        (
+            f"Result: {clean_passes}/{len(results)} clean passes, "
+            f"{expected_errors} expected surfaced errors, {partials} partial recoveries, "
+            f"{failures} failures."
+        ),
         "",
         "## All Cases",
         "",
-        "| Case | Category | Mode | Source | Pass | Agent | Tools | Children | Elapsed |",
-        "| --- | --- | --- | --- | --- | --- | --- | --- | --- |",
+        "| Case | Category | Mode | Source | Outcome | Agent | Handoffs | Tools | Children | Elapsed |",
+        "| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |",
     ]
     for result in results:
         route_source = (_routing_decision(result.message).get("metadata") or {}).get(
@@ -613,8 +732,9 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
                     result.case.category,
                     result.case.routing_mode,
                     str(route_source),
-                    "yes" if result.passed else "no",
+                    result.outcome,
                     result.selected_agent or "-",
+                    _handoff_summary(result) or "-",
                     ", ".join(result.tool_names) or "-",
                     str(len(result.child_sessions)),
                     f"{result.elapsed_s:.1f}s",
@@ -626,6 +746,7 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
     lines.extend(["", "## Best 10 Demo Prompts", ""])
     for rank, result in enumerate(best, start=1):
         tool_text = ", ".join(result.tool_names) or "none"
+        handoff_text = _handoff_summary(result) or "none"
         artifact_text = ", ".join(result.artifacts) or "none"
         child_text = ", ".join(
             child.get("title", child.get("id", "")) for child in result.child_sessions
@@ -637,8 +758,9 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
                 f"Case: `{result.case.case_id}`",
                 f"Category: {result.case.category}",
                 f"Routing mode: `{result.case.routing_mode}`",
-                f"Status: {'pass' if result.passed else 'fail'}",
+                f"Status: {result.outcome}",
                 f"Selected agent: `{result.selected_agent or '-'}`",
+                f"Expert handoffs: {handoff_text}",
                 f"Tools: {tool_text}",
                 f"Child sessions: {child_text or 'none'}",
                 f"Artifacts: {artifact_text}",
@@ -663,10 +785,22 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
             ]
         )
 
-    failures = [result for result in results if not result.passed]
-    if failures:
+    partial_results = [result for result in results if result.outcome == "partial"]
+    if partial_results:
+        lines.extend(["## Partial Recovery Caveats", ""])
+        for result in partial_results:
+            details = (result.partial_error or {}).get("details") or {}
+            lines.extend(
+                [
+                    f"- `{result.case.case_id}`: {result.partial_error.get('message') if result.partial_error else 'partial recovery'}",
+                    f"  stage={details.get('stage')}, tools={', '.join(result.tool_names) or '-'}",
+                ]
+            )
+
+    failed_results = [result for result in results if result.outcome == "fail"]
+    if failed_results:
         lines.extend(["## Failures To Investigate", ""])
-        for result in failures:
+        for result in failed_results:
             lines.extend(
                 [
                     f"- `{result.case.case_id}`: expected {result.case.expected}",
@@ -678,6 +812,20 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
     return "\n".join(lines).strip() + "\n"
 
 
+def _handoff_summary(result: DemoResult) -> str:
+    """Return compact handoff display text with repeated event counts."""
+    counts: dict[str, int] = {}
+    for row in result.expert_handoffs:
+        agent_id = str(row.get("agent_id") or "")
+        if not agent_id:
+            continue
+        counts[agent_id] = counts.get(agent_id, 0) + 1
+    parts = [
+        agent_id if count == 1 else f"{agent_id} x{count}" for agent_id, count in counts.items()
+    ]
+    return ", ".join(parts)
+
+
 def main() -> None:
     """CLI entry point."""
     parser = argparse.ArgumentParser(description=__doc__)
@@ -704,6 +852,12 @@ def main() -> None:
         default=Path("docs/ALCF_DEMO_BENCHMARK_REPORT.md"),
         help="Output markdown report path.",
     )
+    parser.add_argument(
+        "--case-delay-s",
+        type=float,
+        default=0.0,
+        help="Optional cooldown between cases for rate-limited real providers.",
+    )
     args = parser.parse_args()
     raise SystemExit(
         run_benchmark(
@@ -711,6 +865,7 @@ def main() -> None:
             args.data_dir.resolve(),
             args.output_jsonl.resolve(),
             args.report.resolve(),
+            case_delay_s=max(0.0, args.case_delay_s),
         )
     )
 
diff --git a/src/clio_agent/agent.py b/src/clio_agent/agent.py
index 1d024cc..ad7b135 100644
--- a/src/clio_agent/agent.py
+++ b/src/clio_agent/agent.py
@@ -564,6 +564,19 @@ def forward(
                 answer = ""
         except Exception as e:
             success = False
+            inferred_selected = self._selected_expert_from_trace(trace)
+            if selected == "chat" and inferred_selected != "chat":
+                selected = inferred_selected
+                route = RouteDecision(
+                    target=selected,
+                    source=route.source,
+                    reason=(
+                        f"{route.reason} Selected expert inferred from completed "
+                        "tool provenance after the turn failed."
+                    ).strip(),
+                    confidence=route.confidence,
+                    capabilities=route.capabilities,
+                )
             if isinstance(e, RoutingError):
                 error_info = self._with_recovery_actions(e.to_dict())
                 answer = ""
@@ -1745,8 +1758,22 @@ def _execute_tool_action(
                 )
             }
 
+        owner = self._selected_expert_for_tool(tool_name)
         if tool_name in visualization_tools:
-            return self._execute_visualization_tool(tool_name, visualization_tools[tool_name], args)
+            result = self._execute_visualization_tool(
+                tool_name,
+                visualization_tools[tool_name],
+                args,
+            )
+            self._record_direct_tool_handoff(
+                trace,
+                expert_id=owner,
+                tool_name=tool_name,
+                args=args,
+                result=result,
+                duration_ms=self._last_tool_duration_ms(trace, tool_name),
+            )
+            return result
 
         start = time.time()
         try:
@@ -1764,8 +1791,49 @@ def _execute_tool_action(
             duration_ms=duration_ms,
             ok=tool_result_ok(result),
         )
+        self._record_direct_tool_handoff(
+            trace,
+            expert_id=owner,
+            tool_name=tool_name,
+            args=args,
+            result=result,
+            duration_ms=duration_ms,
+        )
         return result
 
+    @staticmethod
+    def _last_tool_duration_ms(trace: RunTrace, tool_name: str) -> float:
+        """Return the latest recorded duration for a tool in an active trace."""
+        for observation in reversed(trace.tools):
+            if observation.tool == tool_name:
+                return observation.duration_ms
+        return 0.0
+
+    def _record_direct_tool_handoff(
+        self,
+        trace: RunTrace,
+        *,
+        expert_id: str,
+        tool_name: str,
+        args: Mapping[str, Any],
+        result: Any,
+        duration_ms: float,
+    ) -> None:
+        """Record the expert boundary for a planner-selected direct tool call."""
+        if not expert_id:
+            return
+        self._record_expert_handoff(
+            trace,
+            expert_id=expert_id,
+            dispatch_target=tool_name,
+            stage="direct_tool",
+            input_summary=f"{tool_name}({', '.join(sorted(str(key) for key in args))})",
+            result=result,
+            status="success" if tool_result_ok(result) else "failure",
+            duration_ms=duration_ms,
+            error=None if tool_result_ok(result) else self._coerce_text(result),
+        )
+
     def _tool_action_scope_error(
         self,
         tool_name: str,
diff --git a/tests/test_core/test_agent_planner.py b/tests/test_core/test_agent_planner.py
index 33b7e7d..7ee4b5e 100644
--- a/tests/test_core/test_agent_planner.py
+++ b/tests/test_core/test_agent_planner.py
@@ -14,7 +14,7 @@
 import pytest
 
 from clio_agent.agent import ClioAgent, cancellation_checker
-from clio_agent.errors import CancellationError, RoutingError
+from clio_agent.errors import CancellationError, ProviderError, RoutingError
 from clio_agent.harness import RouteDecision, RunTrace
 from clio_agent.registry.registry import AgentCapability
 from clio_agent.tools.execution import set_global_tool_observer
@@ -689,6 +689,67 @@ def test_shell_tool_still_runs_for_utility_diagnostics(self, agent):
         assert answer == "It is today."
         agent._execute_tool_action.assert_called_once()
 
+    def test_direct_tool_action_records_owner_handoff(self, agent):
+        trace = _trace()
+        agent._known_tool_names = MagicMock(return_value={"parquet_analyze_schema"})
+        agent.tool_executor.call_tool = MagicMock(
+            return_value={"filepath": "data.parquet", "num_rows": 3, "ok": True}
+        )
+
+        result = agent._execute_tool_action(
+            "parquet_analyze_schema",
+            {"filepath": "data.parquet"},
+            trace,
+        )
+
+        assert result["num_rows"] == 3
+        assert trace.tools[0].tool == "parquet_analyze_schema"
+        assert len(trace.expert_handoffs) == 1
+        handoff = trace.expert_handoffs[0]
+        assert handoff.agent_id == "analysis"
+        assert handoff.dispatch_target == "parquet_analyze_schema"
+        assert handoff.stage == "direct_tool"
+        assert handoff.status == "success"
+
+    def test_provider_error_after_tool_keeps_inferred_expert(self, agent):
+        agent._plan_next_action = MagicMock(
+            side_effect=[
+                {
+                    "action": "tool",
+                    "tool": "parquet_analyze_schema",
+                    "args": {"filepath": "data.parquet"},
+                    "reason": "inspect parquet",
+                },
+                {"action": "answer", "answer": "", "reason": "answer from observations"},
+            ]
+        )
+
+        def execute(tool_name, raw_args, trace, **_kwargs):
+            trace.record_tool(
+                tool=tool_name,
+                params=raw_args,
+                result={"filepath": "data.parquet", "num_rows": 3, "ok": True},
+                duration_ms=1.0,
+                ok=True,
+            )
+            return {"filepath": "data.parquet", "num_rows": 3, "ok": True}
+
+        agent._execute_tool_action = MagicMock(side_effect=execute)
+        agent._synthesize_agent_answer = MagicMock(
+            side_effect=ProviderError(
+                "CLIO could not synthesize a final answer from the completed observations.",
+                details={"stage": "answer_synthesis", "original_error": "rate limit"},
+            )
+        )
+
+        result = agent.forward("Summarize data.parquet", session_id="provider-error-inference")
+
+        assert result.selected_expert == "analysis"
+        assert result.answer == ""
+        assert result.error_info is not None
+        assert result.error_info["error"] == "provider_error"
+        assert "inferred from completed tool provenance" in result.route_reason
+
     def test_shell_scope_validation_ignores_compiled_tool_context(self, agent, tmp_path):
         csv_path = tmp_path / "events.csv"
         csv_path.write_text("event_id,status\n1,ok\n", encoding="utf-8")