diff --git a/.gitignore b/.gitignore
index 44f6538..bf7f920 100644
--- a/.gitignore
+++ b/.gitignore
@@ -45,6 +45,7 @@ tool_cache/
# Tool-generated files
tool_output/
+.clio-agent-artifacts/
*.tool.log
mcp_*.log
diff --git a/TASK.md b/TASK.md
index bee03e4..f956a9e 100644
--- a/TASK.md
+++ b/TASK.md
@@ -1,5 +1,20 @@
# CLIO/GACT Provider Selector Polish Tasks
+## Current Resume Anchor
+
+- [ ] Active benchmark objective: build a true hierarchical CLIO stress benchmark campaign, not another smoke suite. Target natural scientific prompts where the orchestrator delegates to scoped experts, experts delegate to tier-3/nanoagents, tools are owned by expert tags/visibility, and results flow through discovery/staging -> format inspection -> analysis -> visualization with explicit evidence, artifacts, timings, and surfaced errors. Do not hardcode benchmark-specific routes, prompts, or fallback answers.
+- [ ] Current branch: `feat/hierarchical-demo-runner-evidence-20260524`. Uncommitted work after the latest ALCF run: `scripts/run_demo_benchmark.py`, `docs/ALCF_DEMO_BENCHMARK_REPORT.md`, and generated artifact `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`. Inspect artifact size/path before deciding whether to commit it; avoid accidentally committing generated evidence unless intentional.
+- [x] Recently merged hierarchy slices into `develop`: PR #315 added registry hierarchy metadata and exposed tier-3 agents; PR #316 extracted executable `NDPExpert` and `SACFormatExpert`; PR #317 added `ExpertHandoff` trace metadata, GACT message metadata for `expert_handoffs`, and benchmark report support for observed handoff graphs. Full Python suite after #317: `1154 passed, 37 skipped`.
+- [x] Latest live ALCF/Metis evidence run used backend `http://127.0.0.1:17961`, provider `argonne`, API base `https://inference-api.alcf.anl.gov/resource_server/metis/api/v1`, model `gpt-oss-120b`, planner temperature `0`, max tokens `4096`, turn timeout `900s`, allowed root set to the repo. Smoke prompt through GACT answered "Paris is the capital of France." with `error_info=null`, confirming real ALCF inference through CLIO/GACT.
+- [x] Earlier demo runner execution on `17961` reported `15/15` using the old pass logic, but audit found this was too permissive because partial recovery metadata was counted as a normal pass. The current stricter run is the `17962` evidence below.
+- [x] Strongest current demo evidence: `ndp_seismic_waveform_to_plot` selected `visualization`, recorded handoffs `ndp_catalog`, `analysis`, `sac_format`, `visualization`, called NDP search/detail/stage tools plus `sac_inspect_archive`, `sac_compute_trace_statistics`, and `sac_plot_traces`, staged `Pachhai_etal_2023_ScP_data.tar`, found 11260 SAC traces, sampled/visualized traces, and wrote `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`.
+- [ ] Do not call the benchmark objective complete yet. The current 15-case run is useful evidence, but still too shallow for the original goal: most cases are short, only one case clearly has >10 combined tool/handoff events, no multi-minute stress case completed, no context-pressure/compaction case, no provider/model swap during active work, no large dirty data memory stress beyond current fixtures, and direct tool-action cases still lack rich handoff graph evidence.
+- [x] Inspected pass cases with `error_info` and tightened the benchmark runner: surfaced partial-recovery metadata is now outcome `partial`, not `pass`; expected missing-file errors remain `expected_error`.
+- [x] Direct planner-selected tool actions now record owning expert handoff events, not only nested expert dispatches. Current report shows direct HDF5/Parquet/visualization ownership evidence as counted handoff events.
+- [ ] Next benchmark expansion target: add deeper collaborator-grade prompts that stress hidden-task generality: EarthScope/NDP discovery with bounded waveform staging, local file search plus provider discovery, multi-format experiment audit, many parallel tool/nanoagent calls, context pressure/compaction, large-file refusal/memory safety, provider/model swap while work is active, and deliberate unavailable-resource/error-surfacing cases. The best 10 documented demos should be complex enough for external collaborators, not just 30-second route checks.
+- [ ] 2026-05-24 stricter ALCF rerun from current branch on `http://127.0.0.1:17962` with `CLIO_AGENT_MAX_STEPS=12` and `--case-delay-s 5`: `13/15` clean passes, `1` expected surfaced error, `1` partial recovery, `0` hard failures. HDF5 overview became clean after the higher step budget. Direct tool actions now record owner handoff events, so the report shows evidence such as `data x8`, `analysis x5`, and `visualization`. Remaining partial: `workflow_memory_followup` completed a Parquet schema observation and synthesized visible text, but planner continuation hit `litellm.RateLimitError: Tokens/minute limit exceeded`; report now labels it `partial`, not pass.
+- [ ] ALCF provider readiness reporting still has an inconsistency: `/v1/providers/lm` presets report Metis/Sophia `ready` with `Globus token validated`, while `/v1/health` reports LM `degraded` with `ALCF Globus token stored; validate before use`. The TUI/provider status should use the validated provider state, not stale conservative health text.
+
## Open Issues
- [ ] Current ALCF demo benchmark is only a smoke/demo baseline, not a true CLIO hierarchical stress benchmark. Future benchmark work must target hierarchical intelligence: orchestrator -> scoped experts -> tier-3/nanoagents -> cross-expert result handoffs -> visible tool evidence -> artifacts/errors. Add and run complex workflows such as NDP seismic discovery -> staged dataset -> three-axis analysis -> visualization, mixed HDF5/BP5/Parquet/CSV experiment audit, dirty tabular quality review, context-pressure/compaction, large-file memory safety, provider/model swap during active work, and tool-ownership boundary tests. NDP discovery should be owned by `data` or a nested `ndp_catalog` agent, with `analysis` consuming discovered/staged data rather than directly owning NDP search. Benchmark completion requires at least ten human-demoable complex workflows, multiple >2 minute or >10-event runs, tier-3/nanoagent coverage, plotted artifacts, deliberate surfaced failures, and saved evidence for route/expert/tool/artifact/timing/error behavior. See `docs/HIERARCHICAL_STRESS_BENCHMARK_PLAN.md`.
diff --git a/docs/ALCF_DEMO_BENCHMARK_REPORT.md b/docs/ALCF_DEMO_BENCHMARK_REPORT.md
index cd7daed..7e20058 100644
--- a/docs/ALCF_DEMO_BENCHMARK_REPORT.md
+++ b/docs/ALCF_DEMO_BENCHMARK_REPORT.md
@@ -1,28 +1,29 @@
# CLIO ALCF Demo Benchmark Report
-Generated: 2026-05-23 21:30:35 Central Daylight Time
-Evidence JSONL: `D:\Libraries\Documents\projects\clio-agent\tmp\clio-demo-benchmark-hierarchical-ndp-data-v3.jsonl`
+Generated: 2026-05-24 00:12:33 Central Daylight Time
+Evidence JSONL: `D:\Libraries\Documents\projects\clio-agent\tmp\clio-demo-benchmark-alcf-metis-handoff.jsonl`
-Result: 14/14 cases passed.
+Result: 13/15 clean passes, 1 expected surfaced errors, 1 partial recoveries, 0 failures.
## All Cases
-| Case | Category | Mode | Source | Pass | Agent | Tools | Children | Elapsed |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| workflow_hdf5_overview | tooling | auto | dspy | yes | data | hdf5_analyze_file, hdf5_list_datasets | 0 | 17.6s |
-| workflow_parquet_profile | analysis | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 20.1s |
-| workflow_memory_followup | memory | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 22.2s |
-| workflow_csv_event_schema | analysis | auto | dspy | yes | analysis | csv_read_table | 0 | 10.1s |
-| workflow_visual_dashboard | visualization | auto | dspy | yes | visualization | plot_summary | 0 | 8.1s |
-| hdf5_dataset_focus | tooling | auto | dspy | yes | data | hdf5_analyze_dataset | 0 | 7.1s |
-| cross_file_triage_nanoagents | multi-agent | auto | guard | yes | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 0.5s |
-| reasoning_cross_file_triage_nanoagents | planner-hardening | reasoning_only | dspy | yes | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 2.5s |
-| adios_bp5_container | tooling | auto | guard | yes | data | adios_inspect_file | 0 | 0.5s |
-| reasoning_adios_bp5_container | planner-hardening | reasoning_only | dspy | yes | data | adios_inspect_file | 0 | 6.5s |
-| dirty_parquet_quality | analysis | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 2.5s |
-| ndp_catalog_discovery | external-catalog | auto | dspy | yes | data | ndp_list_organizations, ndp_search_datasets | 0 | 8.5s |
-| visual_scatter_artifact | visualization | auto | dspy | yes | visualization | plot_scatter | 0 | 6.1s |
-| missing_hdf5_error | hardening | auto | dspy | yes | data | hdf5_list_datasets | 0 | 4.5s |
+| Case | Category | Mode | Source | Outcome | Agent | Handoffs | Tools | Children | Elapsed |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| workflow_hdf5_overview | tooling | auto | dspy | pass | data | data x8 | hdf5_analyze_file, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset | 0 | 13.1s |
+| workflow_parquet_profile | analysis | auto | dspy | pass | analysis | analysis x5 | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 5.1s |
+| workflow_memory_followup | memory | auto | dspy | partial | analysis | analysis | parquet_analyze_schema | 0 | 36.3s |
+| workflow_csv_event_schema | analysis | auto | dspy | pass | analysis | analysis | csv_read_table | 0 | 3.5s |
+| workflow_visual_dashboard | visualization | auto | dspy | pass | visualization | visualization | plot_summary | 0 | 3.1s |
+| hdf5_dataset_focus | tooling | auto | dspy | pass | data | data | hdf5_analyze_dataset | 0 | 2.5s |
+| cross_file_triage_nanoagents | multi-agent | auto | guard | pass | analysis | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 0.5s |
+| reasoning_cross_file_triage_nanoagents | planner-hardening | reasoning_only | dspy | pass | analysis | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 1.0s |
+| adios_bp5_container | tooling | auto | guard | pass | data | data | adios_inspect_file | 0 | 0.5s |
+| reasoning_adios_bp5_container | planner-hardening | reasoning_only | dspy | pass | data | data | adios_inspect_file | 0 | 2.5s |
+| dirty_parquet_quality | analysis | auto | dspy | pass | analysis | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 1.5s |
+| ndp_catalog_discovery | external-catalog | auto | dspy | pass | ndp_catalog | ndp_catalog | ndp_list_organizations, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource | 0 | 11.1s |
+| ndp_seismic_waveform_to_plot | hierarchical-science | auto | dspy | pass | visualization | ndp_catalog, analysis, sac_format, visualization | ndp_list_organizations, ndp_search_datasets, ndp_search_datasets, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource, sac_inspect_archive, sac_compute_trace_statistics, sac_plot_traces | 0 | 20.6s |
+| visual_scatter_artifact | visualization | auto | dspy | pass | visualization | visualization | plot_scatter | 0 | 2.0s |
+| missing_hdf5_error | hardening | auto | dspy | expected_error | data | data | hdf5_list_datasets | 0 | 2.5s |
## Best 10 Demo Prompts
@@ -33,6 +34,7 @@ Category: multi-agent
Routing mode: `auto`
Status: pass
Selected agent: `analysis`
+Expert handoffs: analysis
Tools: hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table
Child sessions: csv_validator subagent, analysis_validator subagent, adios_validator subagent, data_validator subagent
Artifacts: none
@@ -65,7 +67,7 @@ Inspected HDF5 file D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmar
Compression summary: 5 compressed, 2 uncompressed. Overall raw-to-stored ratio is about 1.12x.
-Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf c
+Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf
```
### 2. No-guard cross-file triage
@@ -75,10 +77,11 @@ Category: planner-hardening
Routing mode: `reasoning_only`
Status: pass
Selected agent: `analysis`
+Expert handoffs: analysis
Tools: hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table
Child sessions: csv_validator subagent, analysis_validator subagent, adios_validator subagent, data_validator subagent
Artifacts: none
-Elapsed: 2.5s
+Elapsed: 1.0s
Prompt:
@@ -107,77 +110,75 @@ Inspected HDF5 file D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmar
Compression summary: 5 compressed, 2 uncompressed. Overall raw-to-stored ratio is about 1.12x.
-Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf c
+Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf
```
-### 3. Memory follow-up without repeating path
+### 3. NDP seismic waveform discovery to plot
-Case: `workflow_memory_followup`
-Category: memory
+Case: `ndp_seismic_waveform_to_plot`
+Category: hierarchical-science
Routing mode: `auto`
Status: pass
-Selected agent: `analysis`
-Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics
+Selected agent: `visualization`
+Expert handoffs: ndp_catalog, analysis, sac_format, visualization
+Tools: ndp_list_organizations, ndp_search_datasets, ndp_search_datasets, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource, sac_inspect_archive, sac_compute_trace_statistics, sac_plot_traces
Child sessions: none
-Artifacts: none
-Elapsed: 22.2s
+Artifacts: D:\Libraries\Documents\projects\clio-agent\.clio-agent-artifacts\charts\sac_traces_Pachhai_etal_2023_ScP_data.png
+Elapsed: 20.6s
Prompt:
```text
-Based on the Parquet file we just profiled, compute whatever schema or column statistics you need for a quick anomaly triage view. Do not ask me for the path again.
+Find a bounded seismic waveform dataset from a seismological or Earth-science organization in the National Data Platform. Choose a usable resource, stage it if it is small enough, inspect the waveform content, compute representative trace statistics, and produce a plot artifact. If a candidate is too large or unavailable, surface that as the result instead of inventing a plot.
```
-What to see: CLIO resolves the previously profiled Parquet file from session context.
+What to see: CLIO delegates NDP discovery to ndp_catalog, stages a bounded waveform resource, analyzes SAC traces through sac_format, and creates a PNG plot.
-Why this is interesting: Demonstrates session memory and current-file resolution instead of copy/paste paths.
+Why this is interesting: This is the core hierarchical science demo: provider discovery, data access, format-specific analysis, and visualization without the user naming internal agents.
Observed excerpt:
```text
-**Parquet file:** `D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet`
-
-| Property | Value |
-|----------|-------|
-| **Rows** | 3 000 |
-| **Row groups** | 8 |
-| **File size** | 154 818 bytes |
-| **Created by** | parquet‑cpp‑arrow v22.0.0 |
-| **Key‑value metadata** | `scenario = clio-stress-benchmark`
`source = deterministic` |
-| **Columns (10 total, 4 shown below)** | `sample_id (int64, nullable)`
`run_id (string, nullable)`
`site (string, nullable)`
`temperature_k (double, nullable)`
`pressure_pa (double, nullable)`
`humidity_pct (double, nullable)`
`vibration_mm_s (double, nullable)`
`anomaly_score (double, nullable)`
… (2 more columns not listed) |
+Data stage:
+Queried the National Data Platform catalog through clio-kit MCP.
-### Column‑level statistics (useful for anomaly triage)
+Organizations matched: 0
-| Column | Min | Max | Mean | Std‑dev | Median | Nulls | Unique values |
-|--------|-----|-----|------|-------
+Datasets matched: 9
+- Data for: Improved Characterization of Ultralow-velocity Zones through Advances in Bayesian Inversion of ScP Waveforms (marriott_library_hive; PDF, TAR, TXT; resources: Pachhai_etal_2023_ScP_data.tar)
+- Data for: A Compositional Component to the Samoa Ultralow-velocity Zone Revealed through 2- and 3-D Waveform Modeling of SKS and SKKS Differential Travel-times and Amplitudes (marriott_library_hive; GZ, TXT; resources: 2021_Samoa_Data.tar.gz)
+- Salton Sea Seismic Data (ucr-earth-and-planetary-sciences; formats not listed; resources: Salton Sea Seismic Waveforms)
+- USGS_LPC_VA_Central_Seismic_2013_LAS_2015 (usgs-3dep; GEOJSON; resources: spatial_extents_USGS_LPC_VA_Central_Seismic_2013_LAS_2015.json)
+- Central U.S. ARRA Lidar, New Madrid Seismic Zone (opentopogra
```
-### 4. Targeted scatter plot
+### 4. HDF5 fusion file overview
-Case: `visual_scatter_artifact`
-Category: visualization
+Case: `workflow_hdf5_overview`
+Category: tooling
Routing mode: `auto`
Status: pass
-Selected agent: `visualization`
-Tools: plot_scatter
+Selected agent: `data`
+Expert handoffs: data x8
+Tools: hdf5_analyze_file, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset
Child sessions: none
-Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_scatter.png, D:\\Libraries\\Documents\\projects\\clio-agent\\tmp\\clio-benchmark-data\\facility_measurements_scatter.png
-Elapsed: 6.1s
+Artifacts: none
+Elapsed: 13.1s
Prompt:
```text
-Create a scatter plot from D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet with vibration_mm_s on the x-axis and anomaly_score on the y-axis. Save it as a PNG and explain what relationship the plot is meant to reveal.
+I need to brief collaborators on this fusion output: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\fusion_run.h5. What datasets are inside, what shapes and units matter, and what compression details should I mention?
```
-What to see: Visualization expert chooses plot_scatter and saves a PNG artifact.
+What to see: Data expert uses HDF5 tools and summarizes datasets, units, and compression.
-Why this is interesting: Checks whether a specific visualization intent maps to the right chart tool.
+Why this is interesting: Proves real HDF5 path handling, tool argument generation, and grounded synthesis.
Observed excerpt:
```text
-The scatter plot (saved at D:\\Libraries\\Documents\\projects\\clio-agent\\tmp\\clio-benchmark-data\\facility_measurements_scatter.png) visualizes the relationship between vibration_mm_s (x‑axis) and anomaly_score (y‑axis), showing whether higher vibration values tend to correspond to higher anomaly scores.
+The file contains 7 datasets: axes/diagnostic_channel (12‑int32, no compression), axes/radius_norm (64‑float64, no compression), axes/time_hours (96‑float64, gzip‑4, chunk 24), diagnostics/heat_flux (96×12‑float32, gzip‑5, units MW/m²), plasma/density (96×64‑float64, gzip‑4, units m⁻³), plasma/electron_temperature (96×64‑float32, gzip‑6, units eV), quality/flags (96×12‑int8, gzip‑4); overall 5 of 7 datasets are gzip‑compressed with overall compression ratio ~1.12.
```
### 5. Parquet facility profile
@@ -187,10 +188,11 @@ Category: analysis
Routing mode: `auto`
Status: pass
Selected agent: `analysis`
+Expert handoffs: analysis x5
Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics
Child sessions: none
Artifacts: none
-Elapsed: 20.1s
+Elapsed: 5.1s
Prompt:
@@ -205,20 +207,63 @@ Why this is interesting: Checks statistical tool calls and model feedback from m
Observed excerpt:
```text
-The file has 10 columns (including temperature_k, pressure_pa, humidity_pct, anomaly_score), 3000 rows, 8 row groups, and the columns show plausible ranges: temperature 267–321 K, pressure 98 kPa–104 kPa, humidity 4–96 %, anomaly_score 0.06–6.35 (mean 1.6), indicating sane data.
+The Parquet file has 10 columns (sample_id int64, run_id string, site string, temperature_k double, pressure_pa double, humidity_pct double, vibration_mm_s double, anomaly_score double, plus two others), 3000 rows across 8 row groups, and the temperature, pressure, humidity, and anomaly_score statistics fall within expected physical ranges, so they appear sane.
+```
+
+### 6. NDP catalog discovery
+
+Case: `ndp_catalog_discovery`
+Category: external-catalog
+Routing mode: `auto`
+Status: pass
+Selected agent: `ndp_catalog`
+Expert handoffs: ndp_catalog
+Tools: ndp_list_organizations, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource
+Child sessions: none
+Artifacts: none
+Elapsed: 11.1s
+
+Prompt:
+
+```text
+Find a few NOAA or climate-related datasets in the National Data Platform catalog that might complement this facility data. Summarize what you found and what I should verify before download.
+```
+
+What to see: Data expert delegates discovery to NDP tools through the CLIO gateway.
+
+Why this is interesting: Exercises external catalog discovery as a data-stage capability, before analysis consumes staged data.
+
+Observed excerpt:
+
+```text
+Queried the National Data Platform catalog through clio-kit MCP.
+
+Organizations matched: 3
+- noaa-global-systems-laboratory
+- noaa-national-centers-for-environmental-information-ncei
+- noaa-demo
+
+Datasets matched: 4
+- Full Climate Connectivity Network (california-landscape-metrics; HTML, SHP, WFS, WMS; resources: [WMS] Full Climate Connectivity Network)
+- Climate Refugia - Baseline (Historical) 1981 - 2010 (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Climate Refugia - Baseline (Historical) 1981 - 2010)
+- Change in Average Climatic Water Deficit (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Change in Average Climatic Water Deficit)
+- Environment Canada Climate Data (ndp-admin-ingest; GEOJSON, CSV; resources: Environment Canada Climate Data)
+
+Staging note: CLIO staged the selected NDP resource at D:\Libraries\Documents\projects\clio-
```
-### 6. Dirty Parquet quality review
+### 7. Dirty Parquet quality review
Case: `dirty_parquet_quality`
Category: analysis
Routing mode: `auto`
Status: pass
Selected agent: `analysis`
+Expert handoffs: analysis
Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics
Child sessions: none
Artifacts: none
-Elapsed: 2.5s
+Elapsed: 1.5s
Prompt:
@@ -247,35 +292,7 @@ site: null_count=0, unique_count=6
temperature_k: min=267.0871843867271, max=320.74640790864, mean=293.98862688749205, median=294.1027798559147, std=7.517766411153003, null_count=18, unique_count=2982
pressure_pa: min=98435.39430602924, max=103998.62554169081, mean=101230.9576727976, median=101229.54620674164, std=767.0242855536812, null_count=14, unique_count=2986
-Recommendatio
-```
-
-### 7. HDF5 fusion file overview
-
-Case: `workflow_hdf5_overview`
-Category: tooling
-Routing mode: `auto`
-Status: pass
-Selected agent: `data`
-Tools: hdf5_analyze_file, hdf5_list_datasets
-Child sessions: none
-Artifacts: none
-Elapsed: 17.6s
-
-Prompt:
-
-```text
-I need to brief collaborators on this fusion output: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\fusion_run.h5. What datasets are inside, what shapes and units matter, and what compression details should I mention?
-```
-
-What to see: Data expert uses HDF5 tools and summarizes datasets, units, and compression.
-
-Why this is interesting: Proves real HDF5 path handling, tool argument generation, and grounded synthesis.
-
-Observed excerpt:
-
-```text
-The HDF5 file holds seven datasets—axes/diagnostic_channel (12 int32), axes/radius_norm (64 float64), axes/time_hours (96 float64), diagnostics/heat_flux (96×12 float32, units MW/m²), plasma/density (96×64 float64, units m⁻³), plasma/electron_temperature (96×64 float32, units eV), and quality/flags (96×12 int8)—with five of them compressed, giving a total stored size of 72 331 bytes versus 80 816 bytes raw (overall compression ratio ≈1.12).
+Recommendati
```
### 8. Follow-up visualization artifact
@@ -285,10 +302,11 @@ Category: visualization
Routing mode: `auto`
Status: pass
Selected agent: `visualization`
+Expert handoffs: visualization
Tools: plot_summary
Child sessions: none
Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png
-Elapsed: 8.1s
+Elapsed: 3.1s
Prompt:
@@ -303,75 +321,89 @@ Why this is interesting: Shows multi-turn handoff from analysis to visualization
Observed excerpt:
```text
-The dashboard was saved to D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png and provides a compact 2×2 summary of the facility_measurements Parquet file, visualizing the key columns (temperature, pressure, humidity, and anomaly_score) with histograms and overview plots of their distributions and relationships.
+The dashboard was saved to D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png and it presents a 2×2 summary of the facility measurements dataset, visualizing key numeric columns such as temperature, pressure, humidity, and anomaly score.
```
-### 9. NDP catalog discovery
+### 9. Targeted scatter plot
-Case: `ndp_catalog_discovery`
-Category: external-catalog
+Case: `visual_scatter_artifact`
+Category: visualization
Routing mode: `auto`
Status: pass
-Selected agent: `data`
-Tools: ndp_list_organizations, ndp_search_datasets
+Selected agent: `visualization`
+Expert handoffs: visualization
+Tools: plot_scatter
Child sessions: none
-Artifacts: none
-Elapsed: 8.5s
+Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_scatter.png
+Elapsed: 2.0s
Prompt:
```text
-Find a few NOAA or climate-related datasets in the National Data Platform catalog that might complement this facility data. Summarize what you found and what I should verify before download.
+Create a scatter plot from D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet with vibration_mm_s on the x-axis and anomaly_score on the y-axis. Save it as a PNG and explain what relationship the plot is meant to reveal.
```
-What to see: Data expert delegates discovery to NDP tools through the CLIO gateway.
+What to see: Visualization expert chooses plot_scatter and saves a PNG artifact.
-Why this is interesting: Exercises external catalog discovery as a data-stage capability, before analysis consumes staged data.
+Why this is interesting: Checks whether a specific visualization intent maps to the right chart tool.
Observed excerpt:
```text
-Queried the National Data Platform catalog through clio-kit MCP.
-
-Organizations matched: 3
-- noaa-global-systems-laboratory
-- noaa-national-centers-for-environmental-information-ncei
-- noaa-demo
-
-Datasets matched: 4
-- Full Climate Connectivity Network (california-landscape-metrics; HTML, SHP, WFS, WMS; resources: [WMS] Full Climate Connectivity Network)
-- Environment Canada Climate Data (ndp-admin-ingest; GEOJSON, CSV; resources: Environment Canada Climate Data)
-- Climate Refugia - Baseline (Historical) 1981 - 2010 (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Climate Refugia - Baseline (Historical) 1981 - 2010)
-- Change in Average Climatic Water Deficit (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Change in Average Climatic Water Deficit)
-
-Recommendations:
-Treat these as discovery results owned by the data stage. Use ndp_get_datase
+The scatter plot (facility_measurements_scatter.png) visualizes vibration_mm_s versus anomaly_score, revealing whether higher vibration levels correspond to higher anomaly scores—typically a positive trend indicates that increased vibration is associated with greater anomaly likelihood.
```
-### 10. No-guard ADIOS/BP5 route
+### 10. Memory follow-up without repeating path
-Case: `reasoning_adios_bp5_container`
-Category: planner-hardening
-Routing mode: `reasoning_only`
-Status: pass
-Selected agent: `data`
-Tools: adios_inspect_file
+Case: `workflow_memory_followup`
+Category: memory
+Routing mode: `auto`
+Status: partial
+Selected agent: `analysis`
+Expert handoffs: analysis
+Tools: parquet_analyze_schema
Child sessions: none
Artifacts: none
-Elapsed: 6.5s
+Elapsed: 36.3s
Prompt:
```text
-This ADIOS BP5 output came from a Gray-Scott run: "D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\gray scott noise 0.01 data.bp5". Tell me what the container looks like, whether profiling metadata is present, and what extra runtime is needed if variable-level metadata is unavailable.
+Based on the Parquet file we just profiled, compute whatever schema or column statistics you need for a quick anomaly triage view. Do not ask me for the path again.
```
-What to see: With routing guards disabled, the planner still selects the data expert.
+What to see: CLIO resolves the previously profiled Parquet file from session context.
-Why this is interesting: Checks that BP5 routing is not only a hardcoded suffix guard behavior.
+Why this is interesting: Demonstrates session memory and current-file resolution instead of copy/paste paths.
Observed excerpt:
```text
-The BP5 container is a directory containing five files (data.0, md.0, md.idx, mmd.0, profiling.json) totaling about 84 MB, it includes profiling metadata (profiling.json), and you need to install the ADIOS2 Python bindings/runtime to read variable‑level metadata.
+**Facility Measurements – Quick Anomaly‑Triaging Summary**
+
+| Item | Value |
+|------|-------|
+| **File** | `D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet` |
+| **Rows** | 3 000 |
+| **Row groups** | 8 |
+| **File size** | 154 818 bytes |
+| **Created by** | parquet‑cpp‑arrow v22.0.0 |
+| **Key‑value metadata** | `scenario = clio-stress-benchmark`
`source = deterministic` |
+
+### Schema (10 columns)
+
+| Column | Type | Nullable | Role for Anomaly Triage |
+|--------|------|----------|--------------------------|
+| `sample_id` | int64 | Yes | Identifier |
+| `run_id` | string | Yes | Identifier |
+| `site` | string | Yes | Identifier / grouping |
+| `temperature_k` | double | Yes | Primary sensor metric |
+| `pressure_pa` | double | Yes | Primary sensor metric |
+| `humidity_pct` | double | Yes | Primary sensor metric |
+| `vibration_mm_s` | double
```
+
+## Partial Recovery Caveats
+
+- `workflow_memory_followup`: Agent planner failed after completed tool observations.
+ stage=post_observation_planning, tools=parquet_analyze_schema
diff --git a/scripts/run_demo_benchmark.py b/scripts/run_demo_benchmark.py
index c96d5b5..20576bb 100644
--- a/scripts/run_demo_benchmark.py
+++ b/scripts/run_demo_benchmark.py
@@ -36,9 +36,10 @@ class DemoCase:
expected: str
session_group: str
timeout_s: float = 480.0
- expected_agent: str = ""
+ expected_agent: str | tuple[str, ...] = ""
expected_tool_prefixes: tuple[str, ...] = ()
expected_tools: tuple[str, ...] = ()
+ expected_handoff_agents: tuple[str, ...] = ()
expected_terms: tuple[str, ...] = ()
min_children: int = 0
expects_error: bool = False
@@ -82,23 +83,62 @@ def artifacts(self) -> list[str]:
"""Return artifact path candidates found in tools or text."""
return _artifact_paths(self.message)
+ @property
+ def expert_handoffs(self) -> list[dict[str, Any]]:
+ """Return expert handoff provenance metadata."""
+ return _expert_handoffs(self.message)
+
+ @property
+ def handoff_agent_ids(self) -> list[str]:
+ """Return unique expert IDs observed in handoff provenance."""
+ seen: set[str] = set()
+ ids: list[str] = []
+ for row in self.expert_handoffs:
+ agent_id = str(row.get("agent_id") or "")
+ if not agent_id or agent_id in seen:
+ continue
+ seen.add(agent_id)
+ ids.append(agent_id)
+ return ids
+
+ @property
+ def handoff_event_count(self) -> int:
+ """Return the number of recorded expert handoff events."""
+ return len(self.expert_handoffs)
+
@property
def blocking_error(self) -> dict[str, Any] | None:
- """Return blocking error_info, allowing documented partial recovery."""
+ """Return error_info that should fail or satisfy error-focused cases."""
return _blocking_error(self.message)
+ @property
+ def partial_error(self) -> dict[str, Any] | None:
+ """Return surfaced partial-recovery metadata, if this turn had any."""
+ return _partial_error(self.message)
+
@property
def passed(self) -> bool:
"""Return whether this case satisfied its declared expectations."""
if self.case.expects_error:
return self.blocking_error is not None and not self.text.strip()
- if self.blocking_error is not None:
+ if self.partial_error is not None:
return False
- if self.case.expected_agent and self.selected_agent != self.case.expected_agent:
+ if self.blocking_error is not None:
return False
+ if self.case.expected_agent:
+ expected_agents = (
+ (self.case.expected_agent,)
+ if isinstance(self.case.expected_agent, str)
+ else self.case.expected_agent
+ )
+ if self.selected_agent not in expected_agents:
+ return False
for expected_tool in self.case.expected_tools:
if expected_tool not in self.tool_names:
return False
+ for expected_agent in self.case.expected_handoff_agents:
+ if expected_agent not in self.handoff_agent_ids:
+ return False
for prefix in self.case.expected_tool_prefixes:
if not any(name.startswith(prefix) for name in self.tool_names):
return False
@@ -110,12 +150,22 @@ def passed(self) -> bool:
return False
return True
+ @property
+ def outcome(self) -> str:
+ """Return a human-readable outcome category."""
+ if self.case.expects_error:
+ return "expected_error" if self.passed else "fail"
+ if self.partial_error is not None:
+ return "partial"
+ return "pass" if self.passed else "fail"
+
@property
def complexity_score(self) -> int:
"""Score cases for the best-demo report."""
return (
len(set(self.tool_names)) * 3
+ len(self.tools)
+ + len(self.expert_handoffs) * 4
+ len(self.child_sessions) * 6
+ len(self.artifacts) * 4
+ len(self.case.complexity_tags) * 2
@@ -145,25 +195,36 @@ def _tools(message: dict[str, Any]) -> list[dict[str, Any]]:
return rows if isinstance(rows, list) else []
+def _expert_handoffs(message: dict[str, Any]) -> list[dict[str, Any]]:
+ rows = (message.get("metadata") or {}).get("expert_handoffs") or []
+ return rows if isinstance(rows, list) else []
+
+
def _tool_name(row: dict[str, Any]) -> str:
return str(row.get("name") or row.get("tool") or "")
def _blocking_error(message: dict[str, Any]) -> dict[str, Any] | None:
+ error_info = message.get("error_info")
+ if not isinstance(error_info, dict):
+ return None
+ if _partial_error(message) is not None:
+ return None
+ return error_info
+
+
+def _partial_error(message: dict[str, Any]) -> dict[str, Any] | None:
error_info = message.get("error_info")
if not isinstance(error_info, dict):
return None
details = error_info.get("details")
- if (
- isinstance(details, dict)
- and details.get("partial") is True
- and details.get("stage")
- in {
- "post_observation_planning",
- "parallel_validation_recovery",
- "step_limit_after_observations",
- }
- ):
+ if not isinstance(details, dict) or details.get("partial") is not True:
+ return None
+ if details.get("stage") not in {
+ "post_observation_planning",
+ "parallel_validation_recovery",
+ "step_limit_after_observations",
+ }:
return None
return error_info
@@ -240,13 +301,18 @@ def _case_row(result: DemoResult) -> dict[str, Any]:
"session_id": result.session_id,
"elapsed_s": round(result.elapsed_s, 3),
"passed": result.passed,
+ "outcome": result.outcome,
"selected_agent": result.selected_agent,
"routing_decision": _routing_decision(result.message),
"tools_called": result.tools,
"tool_names": result.tool_names,
+ "expert_handoffs": result.expert_handoffs,
+ "handoff_agent_ids": result.handoff_agent_ids,
+ "handoff_event_count": result.handoff_event_count,
"child_sessions": result.child_sessions,
"artifacts": result.artifacts,
"error_info": result.message.get("error_info"),
+ "partial_error": result.partial_error,
"stop_reason": result.message.get("stop_reason"),
"provider": result.provider,
"routing_mode": result.case.routing_mode,
@@ -465,8 +531,9 @@ def _make_cases(manifest: dict[str, Any]) -> list[DemoCase]:
title="NDP catalog discovery",
category="external-catalog",
session_group="ndp",
- expected_agent="data",
+ expected_agent=("data", "ndp_catalog"),
expected_tool_prefixes=("ndp_",),
+ expected_handoff_agents=("ndp_catalog",),
expected_terms=("dataset",),
timeout_s=620.0,
complexity_tags=("ndp", "clio-kit", "external-mcp"),
@@ -481,6 +548,42 @@ def _make_cases(manifest: dict[str, Any]) -> list[DemoCase]:
"analysis consumes staged data."
),
),
+ DemoCase(
+ case_id="ndp_seismic_waveform_to_plot",
+ title="NDP seismic waveform discovery to plot",
+ category="hierarchical-science",
+ session_group="ndp_seismic",
+ expected_agent=("visualization", "analysis", "data", "ndp_catalog"),
+ expected_tool_prefixes=("ndp_", "sac_"),
+ expected_handoff_agents=("ndp_catalog", "sac_format"),
+ expected_terms=("SAC", ".png"),
+ timeout_s=900.0,
+ complexity_tags=(
+ "ndp",
+ "earthscience",
+ "tier-3",
+ "sac",
+ "data-analysis-visualization",
+ "artifact",
+ ),
+ prompt=(
+ "Find a bounded seismic waveform dataset from a seismological or "
+ "Earth-science organization in the National Data Platform. Choose a usable "
+ "resource, stage it if it is small enough, inspect the waveform content, "
+ "compute representative trace statistics, and produce a plot artifact. If a "
+ "candidate is too large or unavailable, surface that as the result instead "
+ "of inventing a plot."
+ ),
+ expected=(
+ "CLIO delegates NDP discovery to ndp_catalog, stages a bounded waveform "
+ "resource, analyzes SAC traces through sac_format, and creates a PNG plot."
+ ),
+ why=(
+ "This is the core hierarchical science demo: provider discovery, data "
+ "access, format-specific analysis, and visualization without the user naming "
+ "internal agents."
+ ),
+ ),
DemoCase(
case_id="visual_scatter_artifact",
title="Targeted scatter plot",
@@ -536,7 +639,14 @@ def _create_sessions(http: httpx.Client, cases: list[DemoCase]) -> dict[str, str
return session_ids
-def run_benchmark(base_url: str, data_dir: Path, output_jsonl: Path, report_path: Path) -> int:
+def run_benchmark(
+ base_url: str,
+ data_dir: Path,
+ output_jsonl: Path,
+ report_path: Path,
+ *,
+ case_delay_s: float = 0.0,
+) -> int:
"""Run demo cases and write JSONL plus a markdown report."""
manifest = create_benchmark_data(data_dir)
cases = _make_cases(manifest)
@@ -573,20 +683,25 @@ def run_benchmark(base_url: str, data_dir: Path, output_jsonl: Path, report_path
results.append(result)
log.write(json.dumps(_case_row(result), ensure_ascii=False, default=str) + "\n")
log.flush()
- status = "PASS" if result.passed else "FAIL"
+ status = result.outcome.upper()
print(
f" {status} agent={result.selected_agent or '-'} "
f"tools={','.join(result.tool_names) or '-'} "
f"children={len(result.child_sessions)} elapsed={elapsed_s:.1f}s",
flush=True,
)
+ if case_delay_s > 0 and index < len(cases):
+ time.sleep(case_delay_s)
report_path.write_text(_render_report(results, output_jsonl), encoding="utf-8")
return 0 if all(result.passed for result in results) else 1
def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
- passed = sum(1 for result in results if result.passed)
+ clean_passes = sum(1 for result in results if result.outcome == "pass")
+ expected_errors = sum(1 for result in results if result.outcome == "expected_error")
+ partials = sum(1 for result in results if result.outcome == "partial")
+ failures = sum(1 for result in results if result.outcome == "fail")
best = sorted(results, key=lambda result: result.complexity_score, reverse=True)[:10]
lines = [
"# CLIO ALCF Demo Benchmark Report",
@@ -594,12 +709,16 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
f"Generated: {time.strftime('%Y-%m-%d %H:%M:%S %Z')}",
f"Evidence JSONL: `{output_jsonl}`",
"",
- f"Result: {passed}/{len(results)} cases passed.",
+ (
+ f"Result: {clean_passes}/{len(results)} clean passes, "
+ f"{expected_errors} expected surfaced errors, {partials} partial recoveries, "
+ f"{failures} failures."
+ ),
"",
"## All Cases",
"",
- "| Case | Category | Mode | Source | Pass | Agent | Tools | Children | Elapsed |",
- "| --- | --- | --- | --- | --- | --- | --- | --- | --- |",
+ "| Case | Category | Mode | Source | Outcome | Agent | Handoffs | Tools | Children | Elapsed |",
+ "| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |",
]
for result in results:
route_source = (_routing_decision(result.message).get("metadata") or {}).get(
@@ -613,8 +732,9 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
result.case.category,
result.case.routing_mode,
str(route_source),
- "yes" if result.passed else "no",
+ result.outcome,
result.selected_agent or "-",
+ _handoff_summary(result) or "-",
", ".join(result.tool_names) or "-",
str(len(result.child_sessions)),
f"{result.elapsed_s:.1f}s",
@@ -626,6 +746,7 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
lines.extend(["", "## Best 10 Demo Prompts", ""])
for rank, result in enumerate(best, start=1):
tool_text = ", ".join(result.tool_names) or "none"
+ handoff_text = _handoff_summary(result) or "none"
artifact_text = ", ".join(result.artifacts) or "none"
child_text = ", ".join(
child.get("title", child.get("id", "")) for child in result.child_sessions
@@ -637,8 +758,9 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
f"Case: `{result.case.case_id}`",
f"Category: {result.case.category}",
f"Routing mode: `{result.case.routing_mode}`",
- f"Status: {'pass' if result.passed else 'fail'}",
+ f"Status: {result.outcome}",
f"Selected agent: `{result.selected_agent or '-'}`",
+ f"Expert handoffs: {handoff_text}",
f"Tools: {tool_text}",
f"Child sessions: {child_text or 'none'}",
f"Artifacts: {artifact_text}",
@@ -663,10 +785,22 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
]
)
- failures = [result for result in results if not result.passed]
- if failures:
+ partial_results = [result for result in results if result.outcome == "partial"]
+ if partial_results:
+ lines.extend(["## Partial Recovery Caveats", ""])
+ for result in partial_results:
+ details = (result.partial_error or {}).get("details") or {}
+ lines.extend(
+ [
+ f"- `{result.case.case_id}`: {result.partial_error.get('message') if result.partial_error else 'partial recovery'}",
+ f" stage={details.get('stage')}, tools={', '.join(result.tool_names) or '-'}",
+ ]
+ )
+
+ failed_results = [result for result in results if result.outcome == "fail"]
+ if failed_results:
lines.extend(["## Failures To Investigate", ""])
- for result in failures:
+ for result in failed_results:
lines.extend(
[
f"- `{result.case.case_id}`: expected {result.case.expected}",
@@ -678,6 +812,20 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str:
return "\n".join(lines).strip() + "\n"
+def _handoff_summary(result: DemoResult) -> str:
+ """Return compact handoff display text with repeated event counts."""
+ counts: dict[str, int] = {}
+ for row in result.expert_handoffs:
+ agent_id = str(row.get("agent_id") or "")
+ if not agent_id:
+ continue
+ counts[agent_id] = counts.get(agent_id, 0) + 1
+ parts = [
+ agent_id if count == 1 else f"{agent_id} x{count}" for agent_id, count in counts.items()
+ ]
+ return ", ".join(parts)
+
+
def main() -> None:
"""CLI entry point."""
parser = argparse.ArgumentParser(description=__doc__)
@@ -704,6 +852,12 @@ def main() -> None:
default=Path("docs/ALCF_DEMO_BENCHMARK_REPORT.md"),
help="Output markdown report path.",
)
+ parser.add_argument(
+ "--case-delay-s",
+ type=float,
+ default=0.0,
+ help="Optional cooldown between cases for rate-limited real providers.",
+ )
args = parser.parse_args()
raise SystemExit(
run_benchmark(
@@ -711,6 +865,7 @@ def main() -> None:
args.data_dir.resolve(),
args.output_jsonl.resolve(),
args.report.resolve(),
+ case_delay_s=max(0.0, args.case_delay_s),
)
)
diff --git a/src/clio_agent/agent.py b/src/clio_agent/agent.py
index 1d024cc..ad7b135 100644
--- a/src/clio_agent/agent.py
+++ b/src/clio_agent/agent.py
@@ -564,6 +564,19 @@ def forward(
answer = ""
except Exception as e:
success = False
+ inferred_selected = self._selected_expert_from_trace(trace)
+ if selected == "chat" and inferred_selected != "chat":
+ selected = inferred_selected
+ route = RouteDecision(
+ target=selected,
+ source=route.source,
+ reason=(
+ f"{route.reason} Selected expert inferred from completed "
+ "tool provenance after the turn failed."
+ ).strip(),
+ confidence=route.confidence,
+ capabilities=route.capabilities,
+ )
if isinstance(e, RoutingError):
error_info = self._with_recovery_actions(e.to_dict())
answer = ""
@@ -1745,8 +1758,22 @@ def _execute_tool_action(
)
}
+ owner = self._selected_expert_for_tool(tool_name)
if tool_name in visualization_tools:
- return self._execute_visualization_tool(tool_name, visualization_tools[tool_name], args)
+ result = self._execute_visualization_tool(
+ tool_name,
+ visualization_tools[tool_name],
+ args,
+ )
+ self._record_direct_tool_handoff(
+ trace,
+ expert_id=owner,
+ tool_name=tool_name,
+ args=args,
+ result=result,
+ duration_ms=self._last_tool_duration_ms(trace, tool_name),
+ )
+ return result
start = time.time()
try:
@@ -1764,8 +1791,49 @@ def _execute_tool_action(
duration_ms=duration_ms,
ok=tool_result_ok(result),
)
+ self._record_direct_tool_handoff(
+ trace,
+ expert_id=owner,
+ tool_name=tool_name,
+ args=args,
+ result=result,
+ duration_ms=duration_ms,
+ )
return result
+ @staticmethod
+ def _last_tool_duration_ms(trace: RunTrace, tool_name: str) -> float:
+ """Return the latest recorded duration for a tool in an active trace."""
+ for observation in reversed(trace.tools):
+ if observation.tool == tool_name:
+ return observation.duration_ms
+ return 0.0
+
+ def _record_direct_tool_handoff(
+ self,
+ trace: RunTrace,
+ *,
+ expert_id: str,
+ tool_name: str,
+ args: Mapping[str, Any],
+ result: Any,
+ duration_ms: float,
+ ) -> None:
+ """Record the expert boundary for a planner-selected direct tool call."""
+ if not expert_id:
+ return
+ self._record_expert_handoff(
+ trace,
+ expert_id=expert_id,
+ dispatch_target=tool_name,
+ stage="direct_tool",
+ input_summary=f"{tool_name}({', '.join(sorted(str(key) for key in args))})",
+ result=result,
+ status="success" if tool_result_ok(result) else "failure",
+ duration_ms=duration_ms,
+ error=None if tool_result_ok(result) else self._coerce_text(result),
+ )
+
def _tool_action_scope_error(
self,
tool_name: str,
diff --git a/tests/test_core/test_agent_planner.py b/tests/test_core/test_agent_planner.py
index 33b7e7d..7ee4b5e 100644
--- a/tests/test_core/test_agent_planner.py
+++ b/tests/test_core/test_agent_planner.py
@@ -14,7 +14,7 @@
import pytest
from clio_agent.agent import ClioAgent, cancellation_checker
-from clio_agent.errors import CancellationError, RoutingError
+from clio_agent.errors import CancellationError, ProviderError, RoutingError
from clio_agent.harness import RouteDecision, RunTrace
from clio_agent.registry.registry import AgentCapability
from clio_agent.tools.execution import set_global_tool_observer
@@ -689,6 +689,67 @@ def test_shell_tool_still_runs_for_utility_diagnostics(self, agent):
assert answer == "It is today."
agent._execute_tool_action.assert_called_once()
+ def test_direct_tool_action_records_owner_handoff(self, agent):
+ trace = _trace()
+ agent._known_tool_names = MagicMock(return_value={"parquet_analyze_schema"})
+ agent.tool_executor.call_tool = MagicMock(
+ return_value={"filepath": "data.parquet", "num_rows": 3, "ok": True}
+ )
+
+ result = agent._execute_tool_action(
+ "parquet_analyze_schema",
+ {"filepath": "data.parquet"},
+ trace,
+ )
+
+ assert result["num_rows"] == 3
+ assert trace.tools[0].tool == "parquet_analyze_schema"
+ assert len(trace.expert_handoffs) == 1
+ handoff = trace.expert_handoffs[0]
+ assert handoff.agent_id == "analysis"
+ assert handoff.dispatch_target == "parquet_analyze_schema"
+ assert handoff.stage == "direct_tool"
+ assert handoff.status == "success"
+
+ def test_provider_error_after_tool_keeps_inferred_expert(self, agent):
+ agent._plan_next_action = MagicMock(
+ side_effect=[
+ {
+ "action": "tool",
+ "tool": "parquet_analyze_schema",
+ "args": {"filepath": "data.parquet"},
+ "reason": "inspect parquet",
+ },
+ {"action": "answer", "answer": "", "reason": "answer from observations"},
+ ]
+ )
+
+ def execute(tool_name, raw_args, trace, **_kwargs):
+ trace.record_tool(
+ tool=tool_name,
+ params=raw_args,
+ result={"filepath": "data.parquet", "num_rows": 3, "ok": True},
+ duration_ms=1.0,
+ ok=True,
+ )
+ return {"filepath": "data.parquet", "num_rows": 3, "ok": True}
+
+ agent._execute_tool_action = MagicMock(side_effect=execute)
+ agent._synthesize_agent_answer = MagicMock(
+ side_effect=ProviderError(
+ "CLIO could not synthesize a final answer from the completed observations.",
+ details={"stage": "answer_synthesis", "original_error": "rate limit"},
+ )
+ )
+
+ result = agent.forward("Summarize data.parquet", session_id="provider-error-inference")
+
+ assert result.selected_expert == "analysis"
+ assert result.answer == ""
+ assert result.error_info is not None
+ assert result.error_info["error"] == "provider_error"
+ assert "inferred from completed tool provenance" in result.route_reason
+
def test_shell_scope_validation_ignores_compiled_tool_context(self, agent, tmp_path):
csv_path = tmp_path / "events.csv"
csv_path.write_text("event_id,status\n1,ok\n", encoding="utf-8")