feat(benchmark): record hierarchical handoff evidence by JaimeCernuda · Pull Request #318 · iowarp/clio-agent

JaimeCernuda · 2026-05-24T05:19:38Z

Summary

records direct planner-selected tool calls as owning expert handoff events
keeps inferred expert attribution when provider errors happen after successful tool observations
tightens the demo benchmark runner so partial recovery is reported as partial, not clean pass
adds the natural NDP -> SAC -> visualization demo case and refreshes the ALCF benchmark report

Live ALCF/Metis run on gpt-oss-120b: 13/15 clean passes, 1 expected surfaced error, 1 partial recovery, 0 hard failures
Strongest demo: ndp_seismic_waveform_to_plot with ndp_catalog -> analysis -> sac_format -> visualization handoffs and SAC PNG artifact

uv run ruff check src/ tests/ scripts/run_demo_benchmark.py scripts/create_benchmark_data.py scripts/create_demo_data.py
uv run pytest tests/test_core/test_agent_planner.py::TestRunAgentLoop::test_direct_tool_action_records_owner_handoff tests/test_core/test_agent_planner.py::TestRunAgentLoop::test_provider_error_after_tool_keeps_inferred_expert tests/test_core/test_agent_planner.py::TestRunAgentLoop::test_tool_observation_then_planner_failure_synthesizes_partial_answer tests/test_core/test_agent_dispatch.py::TestForwardDispatch::test_parquet_file_followup_reuses_last_session_path -q
uv run pytest tests/ -> 1156 passed, 37 skipped

feat(benchmark): record hierarchical handoff evidence

18c0a90

JaimeCernuda merged commit 49daf09 into develop May 24, 2026
1 check failed

JaimeCernuda deleted the feat/hierarchical-demo-runner-evidence-20260524 branch May 24, 2026 05:20