fix: gracefully skip cloud-only evaluators in local execution mode by Dongbumlee · Pull Request #106 · Azure/agentops

Dongbumlee · 2026-04-23T19:37:56Z

Summary

Validates issue #79 — Agent Framework multi-agent workflow evaluation with AgentOps. Fixes cloud-only evaluator crash in local mode, adds Agent Framework multi-agent sample, and documents the evaluator availability gap.

Key Finding: Cloud vs Local Evaluator Limitation

The Foundry Cloud Eval API re-runs prompts against models/agents — it cannot score pre-computed outputs from a callable. This means:

Path	Scores our callable output?	Evaluators
Local SDK (`execution_mode: local`)	✅ Yes	3 of 6 (SDK gap)
Foundry Cloud (`execution_mode: remote`)	❌ Re-runs model	All 6

Conclusion: execution_mode: local is the correct path for callable-based workflows. Cloud-only evaluators are gracefully skipped with warnings until the azure-ai-evaluation SDK adds them.

Changes

Core Fix: Graceful skip for cloud-only evaluators

eval_engine.py: _CloudOnlyEvaluatorError + warn+skip in _build_foundry_evaluator_runtimes
runner.py: Warn (not crash) for missing scores; skip threshold checks

Multi-Agent Workflow Sample (Microsoft Agent Framework)

multi_agent_workflow.py: Router→Specialist pattern using agent_framework.Agent + @tool + FoundryChatClient
Agents created dynamically, tool calls captured via @tool wrapper

Additional Improvements

run-agent-local.yaml: New template for local agent workflow
agent_framework_adapter.py: Single-agent Foundry adapter (Threads/Runs + Responses API)
callable_adapter.py: Added Agent Framework reference
local_adapter_backend.py: Capture tool_calls in results
eval_engine.py: Suppress SDK "Conversation history" warning
agent_workflow_baseline.yaml: Document evaluator availability + TaskAdherence behavior
Fix pre-existing lint errors (ruff, mypy)

Evaluator Availability

Evaluator	Local SDK	Foundry Cloud
ToolCallAccuracyEvaluator	✅	✅
IntentResolutionEvaluator	✅	✅
TaskAdherenceEvaluator	✅	✅
TaskCompletionEvaluator	❌ Skipped	✅
ToolSelectionEvaluator	❌ Skipped	✅
ToolInputAccuracyEvaluator	❌ Skipped	✅

E2E Test Results

Local: Agent Framework Multi-Agent Workflow

agent_framework.Agent (Router) → routes to specialist
agent_framework.Agent (Specialist) + @tool → auto-executes tools
→ tool_calls captured, response returned, evaluators scored

5/5 rows ✅ — correct routing (weather/finance/search)
5/5 tool calls ✅ — get_weather, convert_currency, search_news, calculate_compound_interest, search_flights
Scores: ToolCallAccuracy 3.0, IntentResolution 3.2

Cloud: Foundry Agent

5/5 rows ✅ — all 6 evaluators scored via cloud API

CI: ruff ✅ | mypy ✅ | 282 tests passed ✅

Acceptance Criteria (Issue #79)

Criteria	Status
Local multi-agent workflow executes via AgentOps	✅ Microsoft Agent Framework (Agent + @tool + FoundryChatClient)
Tool-call metadata captured	✅ @tool decorator captures calls; stored in backend_metrics.json
Workflow bundle produces actionable results	✅ results.json + report.md with scores and thresholds
Docs/schema gaps identified	✅ Cloud eval can't score pre-computed outputs; 3 SDK evaluators missing
Follow-up items created	✅ All addressed in this PR

Closes #79

When running evaluations locally (hosting: local, execution_mode: local), some Foundry evaluators (TaskCompletionEvaluator, ToolSelectionEvaluator, ToolInputAccuracyEvaluator, TaskNavigationEfficiencyEvaluator) are only available via the Foundry Cloud Evaluation API (builtin.*) and do not exist in the azure-ai-evaluation Python SDK. Previously, using agent_workflow_baseline bundle with local execution crashed with 'Unknown built-in Foundry evaluator class'. Now these evaluators are gracefully skipped with clear warning messages, and threshold checks exclude skipped evaluators from validation. Changes: - eval_engine.py: Add _CloudOnlyEvaluatorError sentinel; catch AttributeError in _load_foundry_evaluator_callable and raise cloud-only error; catch in _build_foundry_evaluator_runtimes and warn+skip - runner.py: Change _validate_enabled_evaluators_scored to warn instead of raise for missing scores; skip threshold checks for evaluators with no scores in _evaluate_item_thresholds and _summarize_thresholds_from_items Tested against: - Local callable with agent_workflow_baseline: 3 evaluators skipped, 3 scored successfully - Cloud model-direct: all evaluators submitted to Foundry API - Cloud Foundry agent: all 6 evaluators scored via cloud API Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…noise Follow-up items from issue #79 E2E validation: 1. Add run-agent-local.yaml template to agentops init scaffold — combines type:agent + hosting:local + framework:agent_framework with agent_workflow_baseline bundle for local workflow testing. 2. Document TaskAdherenceEvaluator behavior — SDK expects multi-turn conversation format (list of message dicts with role/content); single-turn plain text inputs produce low scores. Added note to agent_workflow_baseline.yaml bundle description. 3. Suppress 'Conversation history could not be parsed' warning — add logging filter for azure-ai-evaluation SDK loggers that emit this warning on every single-turn row (expected, harmless). 4. Capture callable return tool_calls in results — extract tool_calls from callable/subprocess output and include in backend_metrics.json row_metrics for post-analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add agent_framework_adapter.py — a dedicated callable adapter that invokes Azure AI Foundry agents locally via the Foundry REST API. Supports two agent ID patterns: - 'asst_*' IDs: Threads/Runs API (create thread → message → run → poll) - Named agents (e.g. 'FoundryAgent'): Responses API with agent_reference The adapter extracts response text and tool_calls from the API response, matching the contract expected by agent_workflow_baseline evaluators. Also updates: - callable_adapter.py: Add Option 4 pointing to agent_framework_adapter - run-agent-local.yaml: Document both adapter options - initializer.py: Include adapter in agentops init scaffold (26 files) Tested E2E against real FoundryAgent via Responses API — all 5 rows processed with evaluator scores returned. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add multi_agent_workflow.py — a sample multi-agent orchestration callable that uses Azure AI Agent Framework SDK (Assistants API) to dynamically create and coordinate agents: Router Agent → analyzes query, selects specialist WeatherSpecialist → get_weather tool FinanceSpecialist → convert_currency, calculate_compound_interest tools SearchSpecialist → search_news, search_flights tools Agents are created dynamically per evaluation row and cleaned up after use. Tool calls are captured from requires_action/submit flow and returned in the callable response. E2E validated: all 5 smoke-agent-tools rows processed with correct routing (weather/finance/search), tool invocations with correct arguments, and evaluator scores (ToolCallAccuracy: 3.0, IntentResolution: 3.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@ref

…from org/repo) Add --from flag to 'agentops skills install' for installing community skills from GitHub repositories following the agentskills.io standard. Features: - Parse github:org/repo[@ref] references with version pinning - Download repo tarball (single request, no API pagination) - Extract skill following agentskills.io convention (SKILL.md + references/) - Platform-aware installation (Copilot, Claude, Cursor) - Security: path sanitization, scripts/ blocked by default, traversal prevention - Provenance: .installed-from.json tracks source repo, ref, and installed files - Auth: GITHUB_TOKEN / GH_TOKEN env var support for private repos Usage: agentops skills install --from donlee/pptx-designer agentops skills install --from github:org/repo@v1.0 --platform claude Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix 2 lint errors that existed on develop branch: - skills.py: remove unused import 'Any' - test_skills.py: remove unused variable 'result' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- multi_agent_workflow.py: explicit str()/list() casts for SPECIALISTS dict values - local_adapter_backend.py: add assert for adapter_command before shlex.split Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…k SDK Replace OpenAI Assistants API (openai.AzureOpenAI.beta.assistants) with Azure AI Agent Framework SDK (azure-ai-agents AgentsClient): - Use AgentsClient.create_agent() for dynamic agent creation - Use FunctionTool with real Python callables (not JSON schemas) - Use ToolSet + enable_auto_function_calls() for automatic tool execution - Use create_thread_and_process_run() for the agent execution loop - Fix tool function signatures for string args from SDK The multi-agent workflow now properly uses the Agent Framework pattern: AgentsClient → create_agent() → FunctionTool → ToolSet → enable_auto_function_calls() → create_thread_and_process_run() E2E validated: all 5 rows, correct routing, tool_calls captured, evaluators scored (ToolCallAccuracy: 3.0, IntentResolution: 3.6). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tool

Replace azure-ai-agents AgentsClient with the actual Microsoft Agent Framework SDK (pip install agent-framework agent-framework-foundry): - Agent: creates agents with instructions and @tool-decorated functions - FoundryChatClient: connects to Azure AI Foundry model deployments - Agent.run(): executes agent with automatic tool call handling - @tool decorator: wraps Python functions for Agent Framework The workflow pattern is now: FoundryChatClient → Agent(client, tools=[@tool]) → Agent.run(query) Router Agent.run() → determines specialist → Specialist Agent.run() Tool calls captured via @tool wrapper functions. Agent Framework auto-executes tools and logs 'Function X succeeded'. E2E: 5/5 rows, correct routing, all tools called with correct args. Scores: ToolCallAccuracy 3.0, IntentResolution 3.6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use 'from agent_framework.foundry import FoundryChatClient' matching the official Microsoft Agent Framework samples, instead of the internal 'agent_framework_foundry' package import. Reference: microsoft/agent-framework samples/03-workflows/ step2_agents_in_a_workflow.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tool

Rewrite multi_agent_workflow.py to follow the official Microsoft Agent Framework workflow pattern (microsoft/agent-framework samples): - WorkflowBuilder with edges connecting Router, Coordinator, Specialists - Custom RoutingCoordinator(Executor) with @handler for routing logic - AgentExecutor wraps each Agent for workflow integration - ctx.send_message() for inter-agent communication - ctx.yield_output() for workflow output Workflow executes as proper supersteps: Superstep 1: Router → Coordinator (routing decision) Superstep 2: Specialist + @tool auto-execution Superstep 3: Coordinator → yield output E2E: 5/5 rows, correct routing, tools executed, evaluators scored. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace azure-ai-agents AgentsClient with Microsoft Agent Framework FoundryAgent (agent_framework.foundry.FoundryAgent) for pre-deployed agent evaluation. Reference: microsoft/agent-framework samples/02-agents/providers/ foundry/foundry_agent_basic.py E2E validated: 5/5 rows against FoundryAgent, evaluators scored. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tool

…ture The adapter was returning only response text with no tool_calls, making tool-related evaluators produce meaningless scores. Now provides local @tool implementations that Agent Framework auto-executes when the Foundry agent makes tool calls. Invocations are captured and returned for evaluator scoring. Flow: run_evaluation(input, context) → FoundryAgent(tools=[get_weather, ...]).run(input) → Agent calls get_weather → Framework auto-executes locally → _captured_tool_calls records the invocation → return {response, tool_calls} Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tool

… tools Replace FoundryAgent (requires server-side tool declarations) with Agent + FoundryChatClient (tools defined entirely in code). Both adapters now use the same Microsoft Agent Framework pattern: - agent_framework_adapter: single Agent with @tool functions - multi_agent_workflow: WorkflowBuilder + Router + Coordinator + Specialists E2E validated both: tool_calls captured in all 5 rows for both adapters. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Dongbumlee and others added 15 commits April 23, 2026 12:37

fix: resolve E402 lint error — move logger after imports

4fb1836

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix: resolve pre-existing ruff lint errors (F401, F841)

0f30bf0

Fix 2 lint errors that existed on develop branch: - skills.py: remove unused import 'Any' - test_skills.py: remove unused variable 'result' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

placerda mentioned this pull request Apr 27, 2026

AgentOps Revamp — First Useful Release (1.0.0) #107

Closed

placerda merged commit a362e5c into develop Apr 27, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: gracefully skip cloud-only evaluators in local execution mode#106

fix: gracefully skip cloud-only evaluators in local execution mode#106
placerda merged 15 commits into
developfrom
fix/graceful-skip-cloud-only-evaluators

Dongbumlee commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dongbumlee commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Finding: Cloud vs Local Evaluator Limitation

Changes

Core Fix: Graceful skip for cloud-only evaluators

Multi-Agent Workflow Sample (Microsoft Agent Framework)

Additional Improvements

Evaluator Availability

E2E Test Results

Local: Agent Framework Multi-Agent Workflow

Cloud: Foundry Agent

CI: ruff ✅ | mypy ✅ | 282 tests passed ✅

Acceptance Criteria (Issue #79)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dongbumlee commented Apr 23, 2026 •

edited

Loading