Skip to content

feat(benchmark): record hierarchical handoff evidence#318

Merged
JaimeCernuda merged 1 commit into
developfrom
feat/hierarchical-demo-runner-evidence-20260524
May 24, 2026
Merged

feat(benchmark): record hierarchical handoff evidence#318
JaimeCernuda merged 1 commit into
developfrom
feat/hierarchical-demo-runner-evidence-20260524

Conversation

@JaimeCernuda
Copy link
Copy Markdown
Collaborator

Summary

  • records direct planner-selected tool calls as owning expert handoff events
  • keeps inferred expert attribution when provider errors happen after successful tool observations
  • tightens the demo benchmark runner so partial recovery is reported as partial, not clean pass
  • adds the natural NDP -> SAC -> visualization demo case and refreshes the ALCF benchmark report

Evidence

  • Live ALCF/Metis run on gpt-oss-120b: 13/15 clean passes, 1 expected surfaced error, 1 partial recovery, 0 hard failures
  • Strongest demo: ndp_seismic_waveform_to_plot with ndp_catalog -> analysis -> sac_format -> visualization handoffs and SAC PNG artifact

Verification

  • uv run ruff check src/ tests/ scripts/run_demo_benchmark.py scripts/create_benchmark_data.py scripts/create_demo_data.py
  • uv run pytest tests/test_core/test_agent_planner.py::TestRunAgentLoop::test_direct_tool_action_records_owner_handoff tests/test_core/test_agent_planner.py::TestRunAgentLoop::test_provider_error_after_tool_keeps_inferred_expert tests/test_core/test_agent_planner.py::TestRunAgentLoop::test_tool_observation_then_planner_failure_synthesizes_partial_answer tests/test_core/test_agent_dispatch.py::TestForwardDispatch::test_parquet_file_followup_reuses_last_session_path -q
  • uv run pytest tests/ -> 1156 passed, 37 skipped

@JaimeCernuda JaimeCernuda merged commit 49daf09 into develop May 24, 2026
1 check failed
@JaimeCernuda JaimeCernuda deleted the feat/hierarchical-demo-runner-evidence-20260524 branch May 24, 2026 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant