Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ tool_cache/

# Tool-generated files
tool_output/
.clio-agent-artifacts/
*.tool.log
mcp_*.log

Expand Down
15 changes: 15 additions & 0 deletions TASK.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# CLIO/GACT Provider Selector Polish Tasks

## Current Resume Anchor

- [ ] Active benchmark objective: build a true hierarchical CLIO stress benchmark campaign, not another smoke suite. Target natural scientific prompts where the orchestrator delegates to scoped experts, experts delegate to tier-3/nanoagents, tools are owned by expert tags/visibility, and results flow through discovery/staging -> format inspection -> analysis -> visualization with explicit evidence, artifacts, timings, and surfaced errors. Do not hardcode benchmark-specific routes, prompts, or fallback answers.
- [ ] Current branch: `feat/hierarchical-demo-runner-evidence-20260524`. Uncommitted work after the latest ALCF run: `scripts/run_demo_benchmark.py`, `docs/ALCF_DEMO_BENCHMARK_REPORT.md`, and generated artifact `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`. Inspect artifact size/path before deciding whether to commit it; avoid accidentally committing generated evidence unless intentional.
- [x] Recently merged hierarchy slices into `develop`: PR #315 added registry hierarchy metadata and exposed tier-3 agents; PR #316 extracted executable `NDPExpert` and `SACFormatExpert`; PR #317 added `ExpertHandoff` trace metadata, GACT message metadata for `expert_handoffs`, and benchmark report support for observed handoff graphs. Full Python suite after #317: `1154 passed, 37 skipped`.
- [x] Latest live ALCF/Metis evidence run used backend `http://127.0.0.1:17961`, provider `argonne`, API base `https://inference-api.alcf.anl.gov/resource_server/metis/api/v1`, model `gpt-oss-120b`, planner temperature `0`, max tokens `4096`, turn timeout `900s`, allowed root set to the repo. Smoke prompt through GACT answered "Paris is the capital of France." with `error_info=null`, confirming real ALCF inference through CLIO/GACT.
- [x] Earlier demo runner execution on `17961` reported `15/15` using the old pass logic, but audit found this was too permissive because partial recovery metadata was counted as a normal pass. The current stricter run is the `17962` evidence below.
- [x] Strongest current demo evidence: `ndp_seismic_waveform_to_plot` selected `visualization`, recorded handoffs `ndp_catalog`, `analysis`, `sac_format`, `visualization`, called NDP search/detail/stage tools plus `sac_inspect_archive`, `sac_compute_trace_statistics`, and `sac_plot_traces`, staged `Pachhai_etal_2023_ScP_data.tar`, found 11260 SAC traces, sampled/visualized traces, and wrote `.clio-agent-artifacts/charts/sac_traces_Pachhai_etal_2023_ScP_data.png`.
- [ ] Do not call the benchmark objective complete yet. The current 15-case run is useful evidence, but still too shallow for the original goal: most cases are short, only one case clearly has >10 combined tool/handoff events, no multi-minute stress case completed, no context-pressure/compaction case, no provider/model swap during active work, no large dirty data memory stress beyond current fixtures, and direct tool-action cases still lack rich handoff graph evidence.
- [x] Inspected pass cases with `error_info` and tightened the benchmark runner: surfaced partial-recovery metadata is now outcome `partial`, not `pass`; expected missing-file errors remain `expected_error`.
- [x] Direct planner-selected tool actions now record owning expert handoff events, not only nested expert dispatches. Current report shows direct HDF5/Parquet/visualization ownership evidence as counted handoff events.
- [ ] Next benchmark expansion target: add deeper collaborator-grade prompts that stress hidden-task generality: EarthScope/NDP discovery with bounded waveform staging, local file search plus provider discovery, multi-format experiment audit, many parallel tool/nanoagent calls, context pressure/compaction, large-file refusal/memory safety, provider/model swap while work is active, and deliberate unavailable-resource/error-surfacing cases. The best 10 documented demos should be complex enough for external collaborators, not just 30-second route checks.
- [ ] 2026-05-24 stricter ALCF rerun from current branch on `http://127.0.0.1:17962` with `CLIO_AGENT_MAX_STEPS=12` and `--case-delay-s 5`: `13/15` clean passes, `1` expected surfaced error, `1` partial recovery, `0` hard failures. HDF5 overview became clean after the higher step budget. Direct tool actions now record owner handoff events, so the report shows evidence such as `data x8`, `analysis x5`, and `visualization`. Remaining partial: `workflow_memory_followup` completed a Parquet schema observation and synthesized visible text, but planner continuation hit `litellm.RateLimitError: Tokens/minute limit exceeded`; report now labels it `partial`, not pass.
- [ ] ALCF provider readiness reporting still has an inconsistency: `/v1/providers/lm` presets report Metis/Sophia `ready` with `Globus token validated`, while `/v1/health` reports LM `degraded` with `ALCF Globus token stored; validate before use`. The TUI/provider status should use the validated provider state, not stale conservative health text.

## Open Issues

- [ ] Current ALCF demo benchmark is only a smoke/demo baseline, not a true CLIO hierarchical stress benchmark. Future benchmark work must target hierarchical intelligence: orchestrator -> scoped experts -> tier-3/nanoagents -> cross-expert result handoffs -> visible tool evidence -> artifacts/errors. Add and run complex workflows such as NDP seismic discovery -> staged dataset -> three-axis analysis -> visualization, mixed HDF5/BP5/Parquet/CSV experiment audit, dirty tabular quality review, context-pressure/compaction, large-file memory safety, provider/model swap during active work, and tool-ownership boundary tests. NDP discovery should be owned by `data` or a nested `ndp_catalog` agent, with `analysis` consuming discovered/staged data rather than directly owning NDP search. Benchmark completion requires at least ten human-demoable complex workflows, multiple >2 minute or >10-event runs, tier-3/nanoagent coverage, plotted artifacts, deliberate surfaced failures, and saved evidence for route/expert/tool/artifact/timing/error behavior. See `docs/HIERARCHICAL_STRESS_BENCHMARK_PLAN.md`.
Expand Down
Loading
Loading