Skip to content

fix: gracefully skip cloud-only evaluators in local execution mode#106

Merged
placerda merged 15 commits into
developfrom
fix/graceful-skip-cloud-only-evaluators
Apr 27, 2026
Merged

fix: gracefully skip cloud-only evaluators in local execution mode#106
placerda merged 15 commits into
developfrom
fix/graceful-skip-cloud-only-evaluators

Conversation

@Dongbumlee
Copy link
Copy Markdown
Collaborator

@Dongbumlee Dongbumlee commented Apr 23, 2026

Summary

Validates issue #79 — Agent Framework multi-agent workflow evaluation with AgentOps. Fixes cloud-only evaluator crash in local mode, adds Agent Framework multi-agent sample, and documents the evaluator availability gap.

Key Finding: Cloud vs Local Evaluator Limitation

The Foundry Cloud Eval API re-runs prompts against models/agents — it cannot score pre-computed outputs from a callable. This means:

Path Scores our callable output? Evaluators
Local SDK (execution_mode: local) ✅ Yes 3 of 6 (SDK gap)
Foundry Cloud (execution_mode: remote) ❌ Re-runs model All 6

Conclusion: execution_mode: local is the correct path for callable-based workflows. Cloud-only evaluators are gracefully skipped with warnings until the azure-ai-evaluation SDK adds them.

Changes

Core Fix: Graceful skip for cloud-only evaluators

  • eval_engine.py: _CloudOnlyEvaluatorError + warn+skip in _build_foundry_evaluator_runtimes
  • runner.py: Warn (not crash) for missing scores; skip threshold checks

Multi-Agent Workflow Sample (Microsoft Agent Framework)

  • multi_agent_workflow.py: Router→Specialist pattern using agent_framework.Agent + @tool + FoundryChatClient
  • Agents created dynamically, tool calls captured via @tool wrapper

Additional Improvements

  • run-agent-local.yaml: New template for local agent workflow
  • agent_framework_adapter.py: Single-agent Foundry adapter (Threads/Runs + Responses API)
  • callable_adapter.py: Added Agent Framework reference
  • local_adapter_backend.py: Capture tool_calls in results
  • eval_engine.py: Suppress SDK "Conversation history" warning
  • agent_workflow_baseline.yaml: Document evaluator availability + TaskAdherence behavior
  • Fix pre-existing lint errors (ruff, mypy)

Evaluator Availability

Evaluator Local SDK Foundry Cloud
ToolCallAccuracyEvaluator
IntentResolutionEvaluator
TaskAdherenceEvaluator
TaskCompletionEvaluator ❌ Skipped
ToolSelectionEvaluator ❌ Skipped
ToolInputAccuracyEvaluator ❌ Skipped

E2E Test Results

Local: Agent Framework Multi-Agent Workflow

agent_framework.Agent (Router) → routes to specialist
agent_framework.Agent (Specialist) + @tool → auto-executes tools
→ tool_calls captured, response returned, evaluators scored
  • 5/5 rows ✅ — correct routing (weather/finance/search)
  • 5/5 tool calls ✅ — get_weather, convert_currency, search_news, calculate_compound_interest, search_flights
  • Scores: ToolCallAccuracy 3.0, IntentResolution 3.2

Cloud: Foundry Agent

  • 5/5 rows ✅ — all 6 evaluators scored via cloud API

CI: ruff ✅ | mypy ✅ | 282 tests passed ✅

Acceptance Criteria (Issue #79)

Criteria Status
Local multi-agent workflow executes via AgentOps ✅ Microsoft Agent Framework (Agent + @tool + FoundryChatClient)
Tool-call metadata captured @tool decorator captures calls; stored in backend_metrics.json
Workflow bundle produces actionable results ✅ results.json + report.md with scores and thresholds
Docs/schema gaps identified ✅ Cloud eval can't score pre-computed outputs; 3 SDK evaluators missing
Follow-up items created ✅ All addressed in this PR

Closes #79

Dongbumlee and others added 15 commits April 23, 2026 12:37
When running evaluations locally (hosting: local, execution_mode: local),
some Foundry evaluators (TaskCompletionEvaluator, ToolSelectionEvaluator,
ToolInputAccuracyEvaluator, TaskNavigationEfficiencyEvaluator) are only
available via the Foundry Cloud Evaluation API (builtin.*) and do not
exist in the azure-ai-evaluation Python SDK.

Previously, using agent_workflow_baseline bundle with local execution
crashed with 'Unknown built-in Foundry evaluator class'. Now these
evaluators are gracefully skipped with clear warning messages, and
threshold checks exclude skipped evaluators from validation.

Changes:
- eval_engine.py: Add _CloudOnlyEvaluatorError sentinel; catch
  AttributeError in _load_foundry_evaluator_callable and raise
  cloud-only error; catch in _build_foundry_evaluator_runtimes
  and warn+skip
- runner.py: Change _validate_enabled_evaluators_scored to warn
  instead of raise for missing scores; skip threshold checks for
  evaluators with no scores in _evaluate_item_thresholds and
  _summarize_thresholds_from_items

Tested against:
- Local callable with agent_workflow_baseline: 3 evaluators skipped,
  3 scored successfully
- Cloud model-direct: all evaluators submitted to Foundry API
- Cloud Foundry agent: all 6 evaluators scored via cloud API

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…noise

Follow-up items from issue #79 E2E validation:

1. Add run-agent-local.yaml template to agentops init scaffold —
   combines type:agent + hosting:local + framework:agent_framework
   with agent_workflow_baseline bundle for local workflow testing.

2. Document TaskAdherenceEvaluator behavior — SDK expects multi-turn
   conversation format (list of message dicts with role/content);
   single-turn plain text inputs produce low scores. Added note to
   agent_workflow_baseline.yaml bundle description.

3. Suppress 'Conversation history could not be parsed' warning —
   add logging filter for azure-ai-evaluation SDK loggers that
   emit this warning on every single-turn row (expected, harmless).

4. Capture callable return tool_calls in results — extract
   tool_calls from callable/subprocess output and include in
   backend_metrics.json row_metrics for post-analysis.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add agent_framework_adapter.py — a dedicated callable adapter that
invokes Azure AI Foundry agents locally via the Foundry REST API.

Supports two agent ID patterns:
- 'asst_*' IDs: Threads/Runs API (create thread → message → run → poll)
- Named agents (e.g. 'FoundryAgent'): Responses API with agent_reference

The adapter extracts response text and tool_calls from the API response,
matching the contract expected by agent_workflow_baseline evaluators.

Also updates:
- callable_adapter.py: Add Option 4 pointing to agent_framework_adapter
- run-agent-local.yaml: Document both adapter options
- initializer.py: Include adapter in agentops init scaffold (26 files)

Tested E2E against real FoundryAgent via Responses API — all 5 rows
processed with evaluator scores returned.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add multi_agent_workflow.py — a sample multi-agent orchestration
callable that uses Azure AI Agent Framework SDK (Assistants API)
to dynamically create and coordinate agents:

  Router Agent → analyzes query, selects specialist
  WeatherSpecialist → get_weather tool
  FinanceSpecialist → convert_currency, calculate_compound_interest tools
  SearchSpecialist → search_news, search_flights tools

Agents are created dynamically per evaluation row and cleaned up
after use. Tool calls are captured from requires_action/submit
flow and returned in the callable response.

E2E validated: all 5 smoke-agent-tools rows processed with correct
routing (weather/finance/search), tool invocations with correct
arguments, and evaluator scores (ToolCallAccuracy: 3.0,
IntentResolution: 3.8).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…from org/repo)

Add --from flag to 'agentops skills install' for installing community skills
from GitHub repositories following the agentskills.io standard.

Features:
- Parse github:org/repo[@ref] references with version pinning
- Download repo tarball (single request, no API pagination)
- Extract skill following agentskills.io convention (SKILL.md + references/)
- Platform-aware installation (Copilot, Claude, Cursor)
- Security: path sanitization, scripts/ blocked by default, traversal prevention
- Provenance: .installed-from.json tracks source repo, ref, and installed files
- Auth: GITHUB_TOKEN / GH_TOKEN env var support for private repos

Usage:
  agentops skills install --from donlee/pptx-designer
  agentops skills install --from github:org/repo@v1.0 --platform claude

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 2 lint errors that existed on develop branch:
- skills.py: remove unused import 'Any'
- test_skills.py: remove unused variable 'result'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- multi_agent_workflow.py: explicit str()/list() casts for SPECIALISTS dict values
- local_adapter_backend.py: add assert for adapter_command before shlex.split

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…k SDK

Replace OpenAI Assistants API (openai.AzureOpenAI.beta.assistants)
with Azure AI Agent Framework SDK (azure-ai-agents AgentsClient):

- Use AgentsClient.create_agent() for dynamic agent creation
- Use FunctionTool with real Python callables (not JSON schemas)
- Use ToolSet + enable_auto_function_calls() for automatic tool execution
- Use create_thread_and_process_run() for the agent execution loop
- Fix tool function signatures for string args from SDK

The multi-agent workflow now properly uses the Agent Framework pattern:
  AgentsClient → create_agent() → FunctionTool → ToolSet
  → enable_auto_function_calls() → create_thread_and_process_run()

E2E validated: all 5 rows, correct routing, tool_calls captured,
evaluators scored (ToolCallAccuracy: 3.0, IntentResolution: 3.6).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace azure-ai-agents AgentsClient with the actual Microsoft Agent
Framework SDK (pip install agent-framework agent-framework-foundry):

- Agent: creates agents with instructions and @tool-decorated functions
- FoundryChatClient: connects to Azure AI Foundry model deployments
- Agent.run(): executes agent with automatic tool call handling
- @tool decorator: wraps Python functions for Agent Framework

The workflow pattern is now:
  FoundryChatClient → Agent(client, tools=[@tool]) → Agent.run(query)
  Router Agent.run() → determines specialist → Specialist Agent.run()

Tool calls captured via @tool wrapper functions. Agent Framework
auto-executes tools and logs 'Function X succeeded'.

E2E: 5/5 rows, correct routing, all tools called with correct args.
Scores: ToolCallAccuracy 3.0, IntentResolution 3.6.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use 'from agent_framework.foundry import FoundryChatClient' matching
the official Microsoft Agent Framework samples, instead of the
internal 'agent_framework_foundry' package import.

Reference: microsoft/agent-framework samples/03-workflows/
  step2_agents_in_a_workflow.py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrite multi_agent_workflow.py to follow the official Microsoft
Agent Framework workflow pattern (microsoft/agent-framework samples):

- WorkflowBuilder with edges connecting Router, Coordinator, Specialists
- Custom RoutingCoordinator(Executor) with @handler for routing logic
- AgentExecutor wraps each Agent for workflow integration
- ctx.send_message() for inter-agent communication
- ctx.yield_output() for workflow output

Workflow executes as proper supersteps:
  Superstep 1: Router → Coordinator (routing decision)
  Superstep 2: Specialist + @tool auto-execution
  Superstep 3: Coordinator → yield output

E2E: 5/5 rows, correct routing, tools executed, evaluators scored.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace azure-ai-agents AgentsClient with Microsoft Agent Framework
FoundryAgent (agent_framework.foundry.FoundryAgent) for pre-deployed
agent evaluation.

Reference: microsoft/agent-framework samples/02-agents/providers/
           foundry/foundry_agent_basic.py

E2E validated: 5/5 rows against FoundryAgent, evaluators scored.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ture

The adapter was returning only response text with no tool_calls,
making tool-related evaluators produce meaningless scores.

Now provides local @tool implementations that Agent Framework
auto-executes when the Foundry agent makes tool calls. Invocations
are captured and returned for evaluator scoring.

Flow:
  run_evaluation(input, context)
    → FoundryAgent(tools=[get_weather, ...]).run(input)
      → Agent calls get_weather → Framework auto-executes locally
      → _captured_tool_calls records the invocation
    → return {response, tool_calls}

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… tools

Replace FoundryAgent (requires server-side tool declarations) with
Agent + FoundryChatClient (tools defined entirely in code).

Both adapters now use the same Microsoft Agent Framework pattern:
- agent_framework_adapter: single Agent with @tool functions
- multi_agent_workflow: WorkflowBuilder + Router + Coordinator + Specialists

E2E validated both: tool_calls captured in all 5 rows for both adapters.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@placerda placerda merged commit a362e5c into develop Apr 27, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants