diff --git a/.gitignore b/.gitignore index 44f6538..bf7f920 100644 --- a/.gitignore +++ b/.gitignore @@ -45,6 +45,7 @@ tool_cache/ # Tool-generated files tool_output/ +.clio-agent-artifacts/ *.tool.log mcp_*.log diff --git a/TASK.md b/TASK.md index bee03e4..f956a9e 100644 --- a/TASK.md +++ b/TASK.md @@ -1,5 +1,20 @@ # CLIO/GACT Provider Selector Polish Tasks +## Current Resume Anchor + +- [ ] Active benchmark objective: build a true hierarchical CLIO stress benchmark campaign, not another smoke suite. Target natural scientific prompts where the orchestrator delegates to scoped experts, experts delegate to tier-3/nanoagents, tools are owned by expert tags/visibility, and results flow through discovery/staging -> format inspection -> analysis -> visualization with explicit evidence, artifacts, timings, and surfaced errors. Do not hardcode benchmark-specific routes, prompts, or fallback answers. +- [ ] Current branch: `feat/hierarchical-demo-runner-evidence-20260524`. Uncommitted work after the latest ALCF run: `scripts/run_demo_benchmark.py`, `docs/ALCF_DEMO_BENCHMARK_REPORT.md`, and generated artifact `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`. Inspect artifact size/path before deciding whether to commit it; avoid accidentally committing generated evidence unless intentional. +- [x] Recently merged hierarchy slices into `develop`: PR #315 added registry hierarchy metadata and exposed tier-3 agents; PR #316 extracted executable `NDPExpert` and `SACFormatExpert`; PR #317 added `ExpertHandoff` trace metadata, GACT message metadata for `expert_handoffs`, and benchmark report support for observed handoff graphs. Full Python suite after #317: `1154 passed, 37 skipped`. +- [x] Latest live ALCF/Metis evidence run used backend `http://127.0.0.1:17961`, provider `argonne`, API base `https://inference-api.alcf.anl.gov/resource_server/metis/api/v1`, model `gpt-oss-120b`, planner temperature `0`, max tokens `4096`, turn timeout `900s`, allowed root set to the repo. Smoke prompt through GACT answered "Paris is the capital of France." with `error_info=null`, confirming real ALCF inference through CLIO/GACT. +- [x] Earlier demo runner execution on `17961` reported `15/15` using the old pass logic, but audit found this was too permissive because partial recovery metadata was counted as a normal pass. The current stricter run is the `17962` evidence below. +- [x] Strongest current demo evidence: `ndp_seismic_waveform_to_plot` selected `visualization`, recorded handoffs `ndp_catalog`, `analysis`, `sac_format`, `visualization`, called NDP search/detail/stage tools plus `sac_inspect_archive`, `sac_compute_trace_statistics`, and `sac_plot_traces`, staged `Pachhai_etal_2023_ScP_data.tar`, found 11260 SAC traces, sampled/visualized traces, and wrote `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`. +- [ ] Do not call the benchmark objective complete yet. The current 15-case run is useful evidence, but still too shallow for the original goal: most cases are short, only one case clearly has >10 combined tool/handoff events, no multi-minute stress case completed, no context-pressure/compaction case, no provider/model swap during active work, no large dirty data memory stress beyond current fixtures, and direct tool-action cases still lack rich handoff graph evidence. +- [x] Inspected pass cases with `error_info` and tightened the benchmark runner: surfaced partial-recovery metadata is now outcome `partial`, not `pass`; expected missing-file errors remain `expected_error`. +- [x] Direct planner-selected tool actions now record owning expert handoff events, not only nested expert dispatches. Current report shows direct HDF5/Parquet/visualization ownership evidence as counted handoff events. +- [ ] Next benchmark expansion target: add deeper collaborator-grade prompts that stress hidden-task generality: EarthScope/NDP discovery with bounded waveform staging, local file search plus provider discovery, multi-format experiment audit, many parallel tool/nanoagent calls, context pressure/compaction, large-file refusal/memory safety, provider/model swap while work is active, and deliberate unavailable-resource/error-surfacing cases. The best 10 documented demos should be complex enough for external collaborators, not just 30-second route checks. +- [ ] 2026-05-24 stricter ALCF rerun from current branch on `http://127.0.0.1:17962` with `CLIO_AGENT_MAX_STEPS=12` and `--case-delay-s 5`: `13/15` clean passes, `1` expected surfaced error, `1` partial recovery, `0` hard failures. HDF5 overview became clean after the higher step budget. Direct tool actions now record owner handoff events, so the report shows evidence such as `data x8`, `analysis x5`, and `visualization`. Remaining partial: `workflow_memory_followup` completed a Parquet schema observation and synthesized visible text, but planner continuation hit `litellm.RateLimitError: Tokens/minute limit exceeded`; report now labels it `partial`, not pass. +- [ ] ALCF provider readiness reporting still has an inconsistency: `/v1/providers/lm` presets report Metis/Sophia `ready` with `Globus token validated`, while `/v1/health` reports LM `degraded` with `ALCF Globus token stored; validate before use`. The TUI/provider status should use the validated provider state, not stale conservative health text. + ## Open Issues - [ ] Current ALCF demo benchmark is only a smoke/demo baseline, not a true CLIO hierarchical stress benchmark. Future benchmark work must target hierarchical intelligence: orchestrator -> scoped experts -> tier-3/nanoagents -> cross-expert result handoffs -> visible tool evidence -> artifacts/errors. Add and run complex workflows such as NDP seismic discovery -> staged dataset -> three-axis analysis -> visualization, mixed HDF5/BP5/Parquet/CSV experiment audit, dirty tabular quality review, context-pressure/compaction, large-file memory safety, provider/model swap during active work, and tool-ownership boundary tests. NDP discovery should be owned by `data` or a nested `ndp_catalog` agent, with `analysis` consuming discovered/staged data rather than directly owning NDP search. Benchmark completion requires at least ten human-demoable complex workflows, multiple >2 minute or >10-event runs, tier-3/nanoagent coverage, plotted artifacts, deliberate surfaced failures, and saved evidence for route/expert/tool/artifact/timing/error behavior. See `docs/HIERARCHICAL_STRESS_BENCHMARK_PLAN.md`. diff --git a/docs/ALCF_DEMO_BENCHMARK_REPORT.md b/docs/ALCF_DEMO_BENCHMARK_REPORT.md index cd7daed..7e20058 100644 --- a/docs/ALCF_DEMO_BENCHMARK_REPORT.md +++ b/docs/ALCF_DEMO_BENCHMARK_REPORT.md @@ -1,28 +1,29 @@ # CLIO ALCF Demo Benchmark Report -Generated: 2026-05-23 21:30:35 Central Daylight Time -Evidence JSONL: `D:\Libraries\Documents\projects\clio-agent\tmp\clio-demo-benchmark-hierarchical-ndp-data-v3.jsonl` +Generated: 2026-05-24 00:12:33 Central Daylight Time +Evidence JSONL: `D:\Libraries\Documents\projects\clio-agent\tmp\clio-demo-benchmark-alcf-metis-handoff.jsonl` -Result: 14/14 cases passed. +Result: 13/15 clean passes, 1 expected surfaced errors, 1 partial recoveries, 0 failures. ## All Cases -| Case | Category | Mode | Source | Pass | Agent | Tools | Children | Elapsed | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | -| workflow_hdf5_overview | tooling | auto | dspy | yes | data | hdf5_analyze_file, hdf5_list_datasets | 0 | 17.6s | -| workflow_parquet_profile | analysis | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 20.1s | -| workflow_memory_followup | memory | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 22.2s | -| workflow_csv_event_schema | analysis | auto | dspy | yes | analysis | csv_read_table | 0 | 10.1s | -| workflow_visual_dashboard | visualization | auto | dspy | yes | visualization | plot_summary | 0 | 8.1s | -| hdf5_dataset_focus | tooling | auto | dspy | yes | data | hdf5_analyze_dataset | 0 | 7.1s | -| cross_file_triage_nanoagents | multi-agent | auto | guard | yes | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 0.5s | -| reasoning_cross_file_triage_nanoagents | planner-hardening | reasoning_only | dspy | yes | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 2.5s | -| adios_bp5_container | tooling | auto | guard | yes | data | adios_inspect_file | 0 | 0.5s | -| reasoning_adios_bp5_container | planner-hardening | reasoning_only | dspy | yes | data | adios_inspect_file | 0 | 6.5s | -| dirty_parquet_quality | analysis | auto | dspy | yes | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 2.5s | -| ndp_catalog_discovery | external-catalog | auto | dspy | yes | data | ndp_list_organizations, ndp_search_datasets | 0 | 8.5s | -| visual_scatter_artifact | visualization | auto | dspy | yes | visualization | plot_scatter | 0 | 6.1s | -| missing_hdf5_error | hardening | auto | dspy | yes | data | hdf5_list_datasets | 0 | 4.5s | +| Case | Category | Mode | Source | Outcome | Agent | Handoffs | Tools | Children | Elapsed | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +| workflow_hdf5_overview | tooling | auto | dspy | pass | data | data x8 | hdf5_analyze_file, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset | 0 | 13.1s | +| workflow_parquet_profile | analysis | auto | dspy | pass | analysis | analysis x5 | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 5.1s | +| workflow_memory_followup | memory | auto | dspy | partial | analysis | analysis | parquet_analyze_schema | 0 | 36.3s | +| workflow_csv_event_schema | analysis | auto | dspy | pass | analysis | analysis | csv_read_table | 0 | 3.5s | +| workflow_visual_dashboard | visualization | auto | dspy | pass | visualization | visualization | plot_summary | 0 | 3.1s | +| hdf5_dataset_focus | tooling | auto | dspy | pass | data | data | hdf5_analyze_dataset | 0 | 2.5s | +| cross_file_triage_nanoagents | multi-agent | auto | guard | pass | analysis | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 0.5s | +| reasoning_cross_file_triage_nanoagents | planner-hardening | reasoning_only | dspy | pass | analysis | analysis | hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table | 4 | 1.0s | +| adios_bp5_container | tooling | auto | guard | pass | data | data | adios_inspect_file | 0 | 0.5s | +| reasoning_adios_bp5_container | planner-hardening | reasoning_only | dspy | pass | data | data | adios_inspect_file | 0 | 2.5s | +| dirty_parquet_quality | analysis | auto | dspy | pass | analysis | analysis | parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics | 0 | 1.5s | +| ndp_catalog_discovery | external-catalog | auto | dspy | pass | ndp_catalog | ndp_catalog | ndp_list_organizations, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource | 0 | 11.1s | +| ndp_seismic_waveform_to_plot | hierarchical-science | auto | dspy | pass | visualization | ndp_catalog, analysis, sac_format, visualization | ndp_list_organizations, ndp_search_datasets, ndp_search_datasets, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource, sac_inspect_archive, sac_compute_trace_statistics, sac_plot_traces | 0 | 20.6s | +| visual_scatter_artifact | visualization | auto | dspy | pass | visualization | visualization | plot_scatter | 0 | 2.0s | +| missing_hdf5_error | hardening | auto | dspy | expected_error | data | data | hdf5_list_datasets | 0 | 2.5s | ## Best 10 Demo Prompts @@ -33,6 +34,7 @@ Category: multi-agent Routing mode: `auto` Status: pass Selected agent: `analysis` +Expert handoffs: analysis Tools: hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table Child sessions: csv_validator subagent, analysis_validator subagent, adios_validator subagent, data_validator subagent Artifacts: none @@ -65,7 +67,7 @@ Inspected HDF5 file D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmar Compression summary: 5 compressed, 2 uncompressed. Overall raw-to-stored ratio is about 1.12x. -Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf c +Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf ``` ### 2. No-guard cross-file triage @@ -75,10 +77,11 @@ Category: planner-hardening Routing mode: `reasoning_only` Status: pass Selected agent: `analysis` +Expert handoffs: analysis Tools: hdf5_analyze_file, hdf5_list_datasets, adios_inspect_file, parquet_analyze_schema, parquet_compute_statistics, csv_read_table Child sessions: csv_validator subagent, analysis_validator subagent, adios_validator subagent, data_validator subagent Artifacts: none -Elapsed: 2.5s +Elapsed: 1.0s Prompt: @@ -107,77 +110,75 @@ Inspected HDF5 file D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmar Compression summary: 5 compressed, 2 uncompressed. Overall raw-to-stored ratio is about 1.12x. -Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf c +Compression is partially configured. Review uncompressed numeric datasets and consider chunked gzip or lzf ``` -### 3. Memory follow-up without repeating path +### 3. NDP seismic waveform discovery to plot -Case: `workflow_memory_followup` -Category: memory +Case: `ndp_seismic_waveform_to_plot` +Category: hierarchical-science Routing mode: `auto` Status: pass -Selected agent: `analysis` -Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics +Selected agent: `visualization` +Expert handoffs: ndp_catalog, analysis, sac_format, visualization +Tools: ndp_list_organizations, ndp_search_datasets, ndp_search_datasets, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource, sac_inspect_archive, sac_compute_trace_statistics, sac_plot_traces Child sessions: none -Artifacts: none -Elapsed: 22.2s +Artifacts: D:\Libraries\Documents\projects\clio-agent\.clio-agent-artifacts\charts\sac_traces_Pachhai_etal_2023_ScP_data.png +Elapsed: 20.6s Prompt: ```text -Based on the Parquet file we just profiled, compute whatever schema or column statistics you need for a quick anomaly triage view. Do not ask me for the path again. +Find a bounded seismic waveform dataset from a seismological or Earth-science organization in the National Data Platform. Choose a usable resource, stage it if it is small enough, inspect the waveform content, compute representative trace statistics, and produce a plot artifact. If a candidate is too large or unavailable, surface that as the result instead of inventing a plot. ``` -What to see: CLIO resolves the previously profiled Parquet file from session context. +What to see: CLIO delegates NDP discovery to ndp_catalog, stages a bounded waveform resource, analyzes SAC traces through sac_format, and creates a PNG plot. -Why this is interesting: Demonstrates session memory and current-file resolution instead of copy/paste paths. +Why this is interesting: This is the core hierarchical science demo: provider discovery, data access, format-specific analysis, and visualization without the user naming internal agents. Observed excerpt: ```text -**Parquet file:** `D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet` - -| Property | Value | -|----------|-------| -| **Rows** | 3 000 | -| **Row groups** | 8 | -| **File size** | 154 818 bytes | -| **Created by** | parquet‑cpp‑arrow v22.0.0 | -| **Key‑value metadata** | `scenario = clio-stress-benchmark`
`source = deterministic` | -| **Columns (10 total, 4 shown below)** | `sample_id (int64, nullable)`
`run_id (string, nullable)`
`site (string, nullable)`
`temperature_k (double, nullable)`
`pressure_pa (double, nullable)`
`humidity_pct (double, nullable)`
`vibration_mm_s (double, nullable)`
`anomaly_score (double, nullable)`
… (2 more columns not listed) | +Data stage: +Queried the National Data Platform catalog through clio-kit MCP. -### Column‑level statistics (useful for anomaly triage) +Organizations matched: 0 -| Column | Min | Max | Mean | Std‑dev | Median | Nulls | Unique values | -|--------|-----|-----|------|------- +Datasets matched: 9 +- Data for: Improved Characterization of Ultralow-velocity Zones through Advances in Bayesian Inversion of ScP Waveforms (marriott_library_hive; PDF, TAR, TXT; resources: Pachhai_etal_2023_ScP_data.tar) +- Data for: A Compositional Component to the Samoa Ultralow-velocity Zone Revealed through 2- and 3-D Waveform Modeling of SKS and SKKS Differential Travel-times and Amplitudes (marriott_library_hive; GZ, TXT; resources: 2021_Samoa_Data.tar.gz) +- Salton Sea Seismic Data (ucr-earth-and-planetary-sciences; formats not listed; resources: Salton Sea Seismic Waveforms) +- USGS_LPC_VA_Central_Seismic_2013_LAS_2015 (usgs-3dep; GEOJSON; resources: spatial_extents_USGS_LPC_VA_Central_Seismic_2013_LAS_2015.json) +- Central U.S. ARRA Lidar, New Madrid Seismic Zone (opentopogra ``` -### 4. Targeted scatter plot +### 4. HDF5 fusion file overview -Case: `visual_scatter_artifact` -Category: visualization +Case: `workflow_hdf5_overview` +Category: tooling Routing mode: `auto` Status: pass -Selected agent: `visualization` -Tools: plot_scatter +Selected agent: `data` +Expert handoffs: data x8 +Tools: hdf5_analyze_file, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset, hdf5_analyze_dataset Child sessions: none -Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_scatter.png, D:\\Libraries\\Documents\\projects\\clio-agent\\tmp\\clio-benchmark-data\\facility_measurements_scatter.png -Elapsed: 6.1s +Artifacts: none +Elapsed: 13.1s Prompt: ```text -Create a scatter plot from D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet with vibration_mm_s on the x-axis and anomaly_score on the y-axis. Save it as a PNG and explain what relationship the plot is meant to reveal. +I need to brief collaborators on this fusion output: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\fusion_run.h5. What datasets are inside, what shapes and units matter, and what compression details should I mention? ``` -What to see: Visualization expert chooses plot_scatter and saves a PNG artifact. +What to see: Data expert uses HDF5 tools and summarizes datasets, units, and compression. -Why this is interesting: Checks whether a specific visualization intent maps to the right chart tool. +Why this is interesting: Proves real HDF5 path handling, tool argument generation, and grounded synthesis. Observed excerpt: ```text -The scatter plot (saved at D:\\Libraries\\Documents\\projects\\clio-agent\\tmp\\clio-benchmark-data\\facility_measurements_scatter.png) visualizes the relationship between vibration_mm_s (x‑axis) and anomaly_score (y‑axis), showing whether higher vibration values tend to correspond to higher anomaly scores. +The file contains 7 datasets: axes/diagnostic_channel (12‑int32, no compression), axes/radius_norm (64‑float64, no compression), axes/time_hours (96‑float64, gzip‑4, chunk 24), diagnostics/heat_flux (96×12‑float32, gzip‑5, units MW/m²), plasma/density (96×64‑float64, gzip‑4, units m⁻³), plasma/electron_temperature (96×64‑float32, gzip‑6, units eV), quality/flags (96×12‑int8, gzip‑4); overall 5 of 7 datasets are gzip‑compressed with overall compression ratio ~1.12. ``` ### 5. Parquet facility profile @@ -187,10 +188,11 @@ Category: analysis Routing mode: `auto` Status: pass Selected agent: `analysis` +Expert handoffs: analysis x5 Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics Child sessions: none Artifacts: none -Elapsed: 20.1s +Elapsed: 5.1s Prompt: @@ -205,20 +207,63 @@ Why this is interesting: Checks statistical tool calls and model feedback from m Observed excerpt: ```text -The file has 10 columns (including temperature_k, pressure_pa, humidity_pct, anomaly_score), 3000 rows, 8 row groups, and the columns show plausible ranges: temperature 267–321 K, pressure 98 kPa–104 kPa, humidity 4–96 %, anomaly_score 0.06–6.35 (mean 1.6), indicating sane data. +The Parquet file has 10 columns (sample_id int64, run_id string, site string, temperature_k double, pressure_pa double, humidity_pct double, vibration_mm_s double, anomaly_score double, plus two others), 3000 rows across 8 row groups, and the temperature, pressure, humidity, and anomaly_score statistics fall within expected physical ranges, so they appear sane. +``` + +### 6. NDP catalog discovery + +Case: `ndp_catalog_discovery` +Category: external-catalog +Routing mode: `auto` +Status: pass +Selected agent: `ndp_catalog` +Expert handoffs: ndp_catalog +Tools: ndp_list_organizations, ndp_search_datasets, ndp_get_dataset_details, ndp_stage_resource +Child sessions: none +Artifacts: none +Elapsed: 11.1s + +Prompt: + +```text +Find a few NOAA or climate-related datasets in the National Data Platform catalog that might complement this facility data. Summarize what you found and what I should verify before download. +``` + +What to see: Data expert delegates discovery to NDP tools through the CLIO gateway. + +Why this is interesting: Exercises external catalog discovery as a data-stage capability, before analysis consumes staged data. + +Observed excerpt: + +```text +Queried the National Data Platform catalog through clio-kit MCP. + +Organizations matched: 3 +- noaa-global-systems-laboratory +- noaa-national-centers-for-environmental-information-ncei +- noaa-demo + +Datasets matched: 4 +- Full Climate Connectivity Network (california-landscape-metrics; HTML, SHP, WFS, WMS; resources: [WMS] Full Climate Connectivity Network) +- Climate Refugia - Baseline (Historical) 1981 - 2010 (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Climate Refugia - Baseline (Historical) 1981 - 2010) +- Change in Average Climatic Water Deficit (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Change in Average Climatic Water Deficit) +- Environment Canada Climate Data (ndp-admin-ingest; GEOJSON, CSV; resources: Environment Canada Climate Data) + +Staging note: CLIO staged the selected NDP resource at D:\Libraries\Documents\projects\clio- ``` -### 6. Dirty Parquet quality review +### 7. Dirty Parquet quality review Case: `dirty_parquet_quality` Category: analysis Routing mode: `auto` Status: pass Selected agent: `analysis` +Expert handoffs: analysis Tools: parquet_analyze_schema, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics, parquet_compute_statistics Child sessions: none Artifacts: none -Elapsed: 2.5s +Elapsed: 1.5s Prompt: @@ -247,35 +292,7 @@ site: null_count=0, unique_count=6 temperature_k: min=267.0871843867271, max=320.74640790864, mean=293.98862688749205, median=294.1027798559147, std=7.517766411153003, null_count=18, unique_count=2982 pressure_pa: min=98435.39430602924, max=103998.62554169081, mean=101230.9576727976, median=101229.54620674164, std=767.0242855536812, null_count=14, unique_count=2986 -Recommendatio -``` - -### 7. HDF5 fusion file overview - -Case: `workflow_hdf5_overview` -Category: tooling -Routing mode: `auto` -Status: pass -Selected agent: `data` -Tools: hdf5_analyze_file, hdf5_list_datasets -Child sessions: none -Artifacts: none -Elapsed: 17.6s - -Prompt: - -```text -I need to brief collaborators on this fusion output: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\fusion_run.h5. What datasets are inside, what shapes and units matter, and what compression details should I mention? -``` - -What to see: Data expert uses HDF5 tools and summarizes datasets, units, and compression. - -Why this is interesting: Proves real HDF5 path handling, tool argument generation, and grounded synthesis. - -Observed excerpt: - -```text -The HDF5 file holds seven datasets—axes/diagnostic_channel (12 int32), axes/radius_norm (64 float64), axes/time_hours (96 float64), diagnostics/heat_flux (96×12 float32, units MW/m²), plasma/density (96×64 float64, units m⁻³), plasma/electron_temperature (96×64 float32, units eV), and quality/flags (96×12 int8)—with five of them compressed, giving a total stored size of 72 331 bytes versus 80 816 bytes raw (overall compression ratio ≈1.12). +Recommendati ``` ### 8. Follow-up visualization artifact @@ -285,10 +302,11 @@ Category: visualization Routing mode: `auto` Status: pass Selected agent: `visualization` +Expert handoffs: visualization Tools: plot_summary Child sessions: none Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png -Elapsed: 8.1s +Elapsed: 3.1s Prompt: @@ -303,75 +321,89 @@ Why this is interesting: Shows multi-turn handoff from analysis to visualization Observed excerpt: ```text -The dashboard was saved to D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png and provides a compact 2×2 summary of the facility_measurements Parquet file, visualizing the key columns (temperature, pressure, humidity, and anomaly_score) with histograms and overview plots of their distributions and relationships. +The dashboard was saved to D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_dashboard.png and it presents a 2×2 summary of the facility measurements dataset, visualizing key numeric columns such as temperature, pressure, humidity, and anomaly score. ``` -### 9. NDP catalog discovery +### 9. Targeted scatter plot -Case: `ndp_catalog_discovery` -Category: external-catalog +Case: `visual_scatter_artifact` +Category: visualization Routing mode: `auto` Status: pass -Selected agent: `data` -Tools: ndp_list_organizations, ndp_search_datasets +Selected agent: `visualization` +Expert handoffs: visualization +Tools: plot_scatter Child sessions: none -Artifacts: none -Elapsed: 8.5s +Artifacts: D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements_scatter.png +Elapsed: 2.0s Prompt: ```text -Find a few NOAA or climate-related datasets in the National Data Platform catalog that might complement this facility data. Summarize what you found and what I should verify before download. +Create a scatter plot from D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet with vibration_mm_s on the x-axis and anomaly_score on the y-axis. Save it as a PNG and explain what relationship the plot is meant to reveal. ``` -What to see: Data expert delegates discovery to NDP tools through the CLIO gateway. +What to see: Visualization expert chooses plot_scatter and saves a PNG artifact. -Why this is interesting: Exercises external catalog discovery as a data-stage capability, before analysis consumes staged data. +Why this is interesting: Checks whether a specific visualization intent maps to the right chart tool. Observed excerpt: ```text -Queried the National Data Platform catalog through clio-kit MCP. - -Organizations matched: 3 -- noaa-global-systems-laboratory -- noaa-national-centers-for-environmental-information-ncei -- noaa-demo - -Datasets matched: 4 -- Full Climate Connectivity Network (california-landscape-metrics; HTML, SHP, WFS, WMS; resources: [WMS] Full Climate Connectivity Network) -- Environment Canada Climate Data (ndp-admin-ingest; GEOJSON, CSV; resources: Environment Canada Climate Data) -- Climate Refugia - Baseline (Historical) 1981 - 2010 (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Climate Refugia - Baseline (Historical) 1981 - 2010) -- Change in Average Climatic Water Deficit (california-landscape-metrics; GEOTIFF, HTML, WCS, WMS; resources: [WMS] Change in Average Climatic Water Deficit) - -Recommendations: -Treat these as discovery results owned by the data stage. Use ndp_get_datase +The scatter plot (facility_measurements_scatter.png) visualizes vibration_mm_s versus anomaly_score, revealing whether higher vibration levels correspond to higher anomaly scores—typically a positive trend indicates that increased vibration is associated with greater anomaly likelihood. ``` -### 10. No-guard ADIOS/BP5 route +### 10. Memory follow-up without repeating path -Case: `reasoning_adios_bp5_container` -Category: planner-hardening -Routing mode: `reasoning_only` -Status: pass -Selected agent: `data` -Tools: adios_inspect_file +Case: `workflow_memory_followup` +Category: memory +Routing mode: `auto` +Status: partial +Selected agent: `analysis` +Expert handoffs: analysis +Tools: parquet_analyze_schema Child sessions: none Artifacts: none -Elapsed: 6.5s +Elapsed: 36.3s Prompt: ```text -This ADIOS BP5 output came from a Gray-Scott run: "D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\gray scott noise 0.01 data.bp5". Tell me what the container looks like, whether profiling metadata is present, and what extra runtime is needed if variable-level metadata is unavailable. +Based on the Parquet file we just profiled, compute whatever schema or column statistics you need for a quick anomaly triage view. Do not ask me for the path again. ``` -What to see: With routing guards disabled, the planner still selects the data expert. +What to see: CLIO resolves the previously profiled Parquet file from session context. -Why this is interesting: Checks that BP5 routing is not only a hardcoded suffix guard behavior. +Why this is interesting: Demonstrates session memory and current-file resolution instead of copy/paste paths. Observed excerpt: ```text -The BP5 container is a directory containing five files (data.0, md.0, md.idx, mmd.0, profiling.json) totaling about 84 MB, it includes profiling metadata (profiling.json), and you need to install the ADIOS2 Python bindings/runtime to read variable‑level metadata. +**Facility Measurements – Quick Anomaly‑Triaging Summary** + +| Item | Value | +|------|-------| +| **File** | `D:\Libraries\Documents\projects\clio-agent\tmp\clio-benchmark-data\facility_measurements.parquet` | +| **Rows** | 3 000 | +| **Row groups** | 8 | +| **File size** | 154 818 bytes | +| **Created by** | parquet‑cpp‑arrow v22.0.0 | +| **Key‑value metadata** | `scenario = clio-stress-benchmark`
`source = deterministic` | + +### Schema (10 columns) + +| Column | Type | Nullable | Role for Anomaly Triage | +|--------|------|----------|--------------------------| +| `sample_id` | int64 | Yes | Identifier | +| `run_id` | string | Yes | Identifier | +| `site` | string | Yes | Identifier / grouping | +| `temperature_k` | double | Yes | Primary sensor metric | +| `pressure_pa` | double | Yes | Primary sensor metric | +| `humidity_pct` | double | Yes | Primary sensor metric | +| `vibration_mm_s` | double ``` + +## Partial Recovery Caveats + +- `workflow_memory_followup`: Agent planner failed after completed tool observations. + stage=post_observation_planning, tools=parquet_analyze_schema diff --git a/scripts/run_demo_benchmark.py b/scripts/run_demo_benchmark.py index c96d5b5..20576bb 100644 --- a/scripts/run_demo_benchmark.py +++ b/scripts/run_demo_benchmark.py @@ -36,9 +36,10 @@ class DemoCase: expected: str session_group: str timeout_s: float = 480.0 - expected_agent: str = "" + expected_agent: str | tuple[str, ...] = "" expected_tool_prefixes: tuple[str, ...] = () expected_tools: tuple[str, ...] = () + expected_handoff_agents: tuple[str, ...] = () expected_terms: tuple[str, ...] = () min_children: int = 0 expects_error: bool = False @@ -82,23 +83,62 @@ def artifacts(self) -> list[str]: """Return artifact path candidates found in tools or text.""" return _artifact_paths(self.message) + @property + def expert_handoffs(self) -> list[dict[str, Any]]: + """Return expert handoff provenance metadata.""" + return _expert_handoffs(self.message) + + @property + def handoff_agent_ids(self) -> list[str]: + """Return unique expert IDs observed in handoff provenance.""" + seen: set[str] = set() + ids: list[str] = [] + for row in self.expert_handoffs: + agent_id = str(row.get("agent_id") or "") + if not agent_id or agent_id in seen: + continue + seen.add(agent_id) + ids.append(agent_id) + return ids + + @property + def handoff_event_count(self) -> int: + """Return the number of recorded expert handoff events.""" + return len(self.expert_handoffs) + @property def blocking_error(self) -> dict[str, Any] | None: - """Return blocking error_info, allowing documented partial recovery.""" + """Return error_info that should fail or satisfy error-focused cases.""" return _blocking_error(self.message) + @property + def partial_error(self) -> dict[str, Any] | None: + """Return surfaced partial-recovery metadata, if this turn had any.""" + return _partial_error(self.message) + @property def passed(self) -> bool: """Return whether this case satisfied its declared expectations.""" if self.case.expects_error: return self.blocking_error is not None and not self.text.strip() - if self.blocking_error is not None: + if self.partial_error is not None: return False - if self.case.expected_agent and self.selected_agent != self.case.expected_agent: + if self.blocking_error is not None: return False + if self.case.expected_agent: + expected_agents = ( + (self.case.expected_agent,) + if isinstance(self.case.expected_agent, str) + else self.case.expected_agent + ) + if self.selected_agent not in expected_agents: + return False for expected_tool in self.case.expected_tools: if expected_tool not in self.tool_names: return False + for expected_agent in self.case.expected_handoff_agents: + if expected_agent not in self.handoff_agent_ids: + return False for prefix in self.case.expected_tool_prefixes: if not any(name.startswith(prefix) for name in self.tool_names): return False @@ -110,12 +150,22 @@ def passed(self) -> bool: return False return True + @property + def outcome(self) -> str: + """Return a human-readable outcome category.""" + if self.case.expects_error: + return "expected_error" if self.passed else "fail" + if self.partial_error is not None: + return "partial" + return "pass" if self.passed else "fail" + @property def complexity_score(self) -> int: """Score cases for the best-demo report.""" return ( len(set(self.tool_names)) * 3 + len(self.tools) + + len(self.expert_handoffs) * 4 + len(self.child_sessions) * 6 + len(self.artifacts) * 4 + len(self.case.complexity_tags) * 2 @@ -145,25 +195,36 @@ def _tools(message: dict[str, Any]) -> list[dict[str, Any]]: return rows if isinstance(rows, list) else [] +def _expert_handoffs(message: dict[str, Any]) -> list[dict[str, Any]]: + rows = (message.get("metadata") or {}).get("expert_handoffs") or [] + return rows if isinstance(rows, list) else [] + + def _tool_name(row: dict[str, Any]) -> str: return str(row.get("name") or row.get("tool") or "") def _blocking_error(message: dict[str, Any]) -> dict[str, Any] | None: + error_info = message.get("error_info") + if not isinstance(error_info, dict): + return None + if _partial_error(message) is not None: + return None + return error_info + + +def _partial_error(message: dict[str, Any]) -> dict[str, Any] | None: error_info = message.get("error_info") if not isinstance(error_info, dict): return None details = error_info.get("details") - if ( - isinstance(details, dict) - and details.get("partial") is True - and details.get("stage") - in { - "post_observation_planning", - "parallel_validation_recovery", - "step_limit_after_observations", - } - ): + if not isinstance(details, dict) or details.get("partial") is not True: + return None + if details.get("stage") not in { + "post_observation_planning", + "parallel_validation_recovery", + "step_limit_after_observations", + }: return None return error_info @@ -240,13 +301,18 @@ def _case_row(result: DemoResult) -> dict[str, Any]: "session_id": result.session_id, "elapsed_s": round(result.elapsed_s, 3), "passed": result.passed, + "outcome": result.outcome, "selected_agent": result.selected_agent, "routing_decision": _routing_decision(result.message), "tools_called": result.tools, "tool_names": result.tool_names, + "expert_handoffs": result.expert_handoffs, + "handoff_agent_ids": result.handoff_agent_ids, + "handoff_event_count": result.handoff_event_count, "child_sessions": result.child_sessions, "artifacts": result.artifacts, "error_info": result.message.get("error_info"), + "partial_error": result.partial_error, "stop_reason": result.message.get("stop_reason"), "provider": result.provider, "routing_mode": result.case.routing_mode, @@ -465,8 +531,9 @@ def _make_cases(manifest: dict[str, Any]) -> list[DemoCase]: title="NDP catalog discovery", category="external-catalog", session_group="ndp", - expected_agent="data", + expected_agent=("data", "ndp_catalog"), expected_tool_prefixes=("ndp_",), + expected_handoff_agents=("ndp_catalog",), expected_terms=("dataset",), timeout_s=620.0, complexity_tags=("ndp", "clio-kit", "external-mcp"), @@ -481,6 +548,42 @@ def _make_cases(manifest: dict[str, Any]) -> list[DemoCase]: "analysis consumes staged data." ), ), + DemoCase( + case_id="ndp_seismic_waveform_to_plot", + title="NDP seismic waveform discovery to plot", + category="hierarchical-science", + session_group="ndp_seismic", + expected_agent=("visualization", "analysis", "data", "ndp_catalog"), + expected_tool_prefixes=("ndp_", "sac_"), + expected_handoff_agents=("ndp_catalog", "sac_format"), + expected_terms=("SAC", ".png"), + timeout_s=900.0, + complexity_tags=( + "ndp", + "earthscience", + "tier-3", + "sac", + "data-analysis-visualization", + "artifact", + ), + prompt=( + "Find a bounded seismic waveform dataset from a seismological or " + "Earth-science organization in the National Data Platform. Choose a usable " + "resource, stage it if it is small enough, inspect the waveform content, " + "compute representative trace statistics, and produce a plot artifact. If a " + "candidate is too large or unavailable, surface that as the result instead " + "of inventing a plot." + ), + expected=( + "CLIO delegates NDP discovery to ndp_catalog, stages a bounded waveform " + "resource, analyzes SAC traces through sac_format, and creates a PNG plot." + ), + why=( + "This is the core hierarchical science demo: provider discovery, data " + "access, format-specific analysis, and visualization without the user naming " + "internal agents." + ), + ), DemoCase( case_id="visual_scatter_artifact", title="Targeted scatter plot", @@ -536,7 +639,14 @@ def _create_sessions(http: httpx.Client, cases: list[DemoCase]) -> dict[str, str return session_ids -def run_benchmark(base_url: str, data_dir: Path, output_jsonl: Path, report_path: Path) -> int: +def run_benchmark( + base_url: str, + data_dir: Path, + output_jsonl: Path, + report_path: Path, + *, + case_delay_s: float = 0.0, +) -> int: """Run demo cases and write JSONL plus a markdown report.""" manifest = create_benchmark_data(data_dir) cases = _make_cases(manifest) @@ -573,20 +683,25 @@ def run_benchmark(base_url: str, data_dir: Path, output_jsonl: Path, report_path results.append(result) log.write(json.dumps(_case_row(result), ensure_ascii=False, default=str) + "\n") log.flush() - status = "PASS" if result.passed else "FAIL" + status = result.outcome.upper() print( f" {status} agent={result.selected_agent or '-'} " f"tools={','.join(result.tool_names) or '-'} " f"children={len(result.child_sessions)} elapsed={elapsed_s:.1f}s", flush=True, ) + if case_delay_s > 0 and index < len(cases): + time.sleep(case_delay_s) report_path.write_text(_render_report(results, output_jsonl), encoding="utf-8") return 0 if all(result.passed for result in results) else 1 def _render_report(results: list[DemoResult], output_jsonl: Path) -> str: - passed = sum(1 for result in results if result.passed) + clean_passes = sum(1 for result in results if result.outcome == "pass") + expected_errors = sum(1 for result in results if result.outcome == "expected_error") + partials = sum(1 for result in results if result.outcome == "partial") + failures = sum(1 for result in results if result.outcome == "fail") best = sorted(results, key=lambda result: result.complexity_score, reverse=True)[:10] lines = [ "# CLIO ALCF Demo Benchmark Report", @@ -594,12 +709,16 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str: f"Generated: {time.strftime('%Y-%m-%d %H:%M:%S %Z')}", f"Evidence JSONL: `{output_jsonl}`", "", - f"Result: {passed}/{len(results)} cases passed.", + ( + f"Result: {clean_passes}/{len(results)} clean passes, " + f"{expected_errors} expected surfaced errors, {partials} partial recoveries, " + f"{failures} failures." + ), "", "## All Cases", "", - "| Case | Category | Mode | Source | Pass | Agent | Tools | Children | Elapsed |", - "| --- | --- | --- | --- | --- | --- | --- | --- | --- |", + "| Case | Category | Mode | Source | Outcome | Agent | Handoffs | Tools | Children | Elapsed |", + "| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |", ] for result in results: route_source = (_routing_decision(result.message).get("metadata") or {}).get( @@ -613,8 +732,9 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str: result.case.category, result.case.routing_mode, str(route_source), - "yes" if result.passed else "no", + result.outcome, result.selected_agent or "-", + _handoff_summary(result) or "-", ", ".join(result.tool_names) or "-", str(len(result.child_sessions)), f"{result.elapsed_s:.1f}s", @@ -626,6 +746,7 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str: lines.extend(["", "## Best 10 Demo Prompts", ""]) for rank, result in enumerate(best, start=1): tool_text = ", ".join(result.tool_names) or "none" + handoff_text = _handoff_summary(result) or "none" artifact_text = ", ".join(result.artifacts) or "none" child_text = ", ".join( child.get("title", child.get("id", "")) for child in result.child_sessions @@ -637,8 +758,9 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str: f"Case: `{result.case.case_id}`", f"Category: {result.case.category}", f"Routing mode: `{result.case.routing_mode}`", - f"Status: {'pass' if result.passed else 'fail'}", + f"Status: {result.outcome}", f"Selected agent: `{result.selected_agent or '-'}`", + f"Expert handoffs: {handoff_text}", f"Tools: {tool_text}", f"Child sessions: {child_text or 'none'}", f"Artifacts: {artifact_text}", @@ -663,10 +785,22 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str: ] ) - failures = [result for result in results if not result.passed] - if failures: + partial_results = [result for result in results if result.outcome == "partial"] + if partial_results: + lines.extend(["## Partial Recovery Caveats", ""]) + for result in partial_results: + details = (result.partial_error or {}).get("details") or {} + lines.extend( + [ + f"- `{result.case.case_id}`: {result.partial_error.get('message') if result.partial_error else 'partial recovery'}", + f" stage={details.get('stage')}, tools={', '.join(result.tool_names) or '-'}", + ] + ) + + failed_results = [result for result in results if result.outcome == "fail"] + if failed_results: lines.extend(["## Failures To Investigate", ""]) - for result in failures: + for result in failed_results: lines.extend( [ f"- `{result.case.case_id}`: expected {result.case.expected}", @@ -678,6 +812,20 @@ def _render_report(results: list[DemoResult], output_jsonl: Path) -> str: return "\n".join(lines).strip() + "\n" +def _handoff_summary(result: DemoResult) -> str: + """Return compact handoff display text with repeated event counts.""" + counts: dict[str, int] = {} + for row in result.expert_handoffs: + agent_id = str(row.get("agent_id") or "") + if not agent_id: + continue + counts[agent_id] = counts.get(agent_id, 0) + 1 + parts = [ + agent_id if count == 1 else f"{agent_id} x{count}" for agent_id, count in counts.items() + ] + return ", ".join(parts) + + def main() -> None: """CLI entry point.""" parser = argparse.ArgumentParser(description=__doc__) @@ -704,6 +852,12 @@ def main() -> None: default=Path("docs/ALCF_DEMO_BENCHMARK_REPORT.md"), help="Output markdown report path.", ) + parser.add_argument( + "--case-delay-s", + type=float, + default=0.0, + help="Optional cooldown between cases for rate-limited real providers.", + ) args = parser.parse_args() raise SystemExit( run_benchmark( @@ -711,6 +865,7 @@ def main() -> None: args.data_dir.resolve(), args.output_jsonl.resolve(), args.report.resolve(), + case_delay_s=max(0.0, args.case_delay_s), ) ) diff --git a/src/clio_agent/agent.py b/src/clio_agent/agent.py index 1d024cc..ad7b135 100644 --- a/src/clio_agent/agent.py +++ b/src/clio_agent/agent.py @@ -564,6 +564,19 @@ def forward( answer = "" except Exception as e: success = False + inferred_selected = self._selected_expert_from_trace(trace) + if selected == "chat" and inferred_selected != "chat": + selected = inferred_selected + route = RouteDecision( + target=selected, + source=route.source, + reason=( + f"{route.reason} Selected expert inferred from completed " + "tool provenance after the turn failed." + ).strip(), + confidence=route.confidence, + capabilities=route.capabilities, + ) if isinstance(e, RoutingError): error_info = self._with_recovery_actions(e.to_dict()) answer = "" @@ -1745,8 +1758,22 @@ def _execute_tool_action( ) } + owner = self._selected_expert_for_tool(tool_name) if tool_name in visualization_tools: - return self._execute_visualization_tool(tool_name, visualization_tools[tool_name], args) + result = self._execute_visualization_tool( + tool_name, + visualization_tools[tool_name], + args, + ) + self._record_direct_tool_handoff( + trace, + expert_id=owner, + tool_name=tool_name, + args=args, + result=result, + duration_ms=self._last_tool_duration_ms(trace, tool_name), + ) + return result start = time.time() try: @@ -1764,8 +1791,49 @@ def _execute_tool_action( duration_ms=duration_ms, ok=tool_result_ok(result), ) + self._record_direct_tool_handoff( + trace, + expert_id=owner, + tool_name=tool_name, + args=args, + result=result, + duration_ms=duration_ms, + ) return result + @staticmethod + def _last_tool_duration_ms(trace: RunTrace, tool_name: str) -> float: + """Return the latest recorded duration for a tool in an active trace.""" + for observation in reversed(trace.tools): + if observation.tool == tool_name: + return observation.duration_ms + return 0.0 + + def _record_direct_tool_handoff( + self, + trace: RunTrace, + *, + expert_id: str, + tool_name: str, + args: Mapping[str, Any], + result: Any, + duration_ms: float, + ) -> None: + """Record the expert boundary for a planner-selected direct tool call.""" + if not expert_id: + return + self._record_expert_handoff( + trace, + expert_id=expert_id, + dispatch_target=tool_name, + stage="direct_tool", + input_summary=f"{tool_name}({', '.join(sorted(str(key) for key in args))})", + result=result, + status="success" if tool_result_ok(result) else "failure", + duration_ms=duration_ms, + error=None if tool_result_ok(result) else self._coerce_text(result), + ) + def _tool_action_scope_error( self, tool_name: str, diff --git a/tests/test_core/test_agent_planner.py b/tests/test_core/test_agent_planner.py index 33b7e7d..7ee4b5e 100644 --- a/tests/test_core/test_agent_planner.py +++ b/tests/test_core/test_agent_planner.py @@ -14,7 +14,7 @@ import pytest from clio_agent.agent import ClioAgent, cancellation_checker -from clio_agent.errors import CancellationError, RoutingError +from clio_agent.errors import CancellationError, ProviderError, RoutingError from clio_agent.harness import RouteDecision, RunTrace from clio_agent.registry.registry import AgentCapability from clio_agent.tools.execution import set_global_tool_observer @@ -689,6 +689,67 @@ def test_shell_tool_still_runs_for_utility_diagnostics(self, agent): assert answer == "It is today." agent._execute_tool_action.assert_called_once() + def test_direct_tool_action_records_owner_handoff(self, agent): + trace = _trace() + agent._known_tool_names = MagicMock(return_value={"parquet_analyze_schema"}) + agent.tool_executor.call_tool = MagicMock( + return_value={"filepath": "data.parquet", "num_rows": 3, "ok": True} + ) + + result = agent._execute_tool_action( + "parquet_analyze_schema", + {"filepath": "data.parquet"}, + trace, + ) + + assert result["num_rows"] == 3 + assert trace.tools[0].tool == "parquet_analyze_schema" + assert len(trace.expert_handoffs) == 1 + handoff = trace.expert_handoffs[0] + assert handoff.agent_id == "analysis" + assert handoff.dispatch_target == "parquet_analyze_schema" + assert handoff.stage == "direct_tool" + assert handoff.status == "success" + + def test_provider_error_after_tool_keeps_inferred_expert(self, agent): + agent._plan_next_action = MagicMock( + side_effect=[ + { + "action": "tool", + "tool": "parquet_analyze_schema", + "args": {"filepath": "data.parquet"}, + "reason": "inspect parquet", + }, + {"action": "answer", "answer": "", "reason": "answer from observations"}, + ] + ) + + def execute(tool_name, raw_args, trace, **_kwargs): + trace.record_tool( + tool=tool_name, + params=raw_args, + result={"filepath": "data.parquet", "num_rows": 3, "ok": True}, + duration_ms=1.0, + ok=True, + ) + return {"filepath": "data.parquet", "num_rows": 3, "ok": True} + + agent._execute_tool_action = MagicMock(side_effect=execute) + agent._synthesize_agent_answer = MagicMock( + side_effect=ProviderError( + "CLIO could not synthesize a final answer from the completed observations.", + details={"stage": "answer_synthesis", "original_error": "rate limit"}, + ) + ) + + result = agent.forward("Summarize data.parquet", session_id="provider-error-inference") + + assert result.selected_expert == "analysis" + assert result.answer == "" + assert result.error_info is not None + assert result.error_info["error"] == "provider_error" + assert "inferred from completed tool provenance" in result.route_reason + def test_shell_scope_validation_ignores_compiled_tool_context(self, agent, tmp_path): csv_path = tmp_path / "events.csv" csv_path.write_text("event_id,status\n1,ok\n", encoding="utf-8")