fix: gracefully skip cloud-only evaluators in local execution mode#106
Merged
Conversation
When running evaluations locally (hosting: local, execution_mode: local), some Foundry evaluators (TaskCompletionEvaluator, ToolSelectionEvaluator, ToolInputAccuracyEvaluator, TaskNavigationEfficiencyEvaluator) are only available via the Foundry Cloud Evaluation API (builtin.*) and do not exist in the azure-ai-evaluation Python SDK. Previously, using agent_workflow_baseline bundle with local execution crashed with 'Unknown built-in Foundry evaluator class'. Now these evaluators are gracefully skipped with clear warning messages, and threshold checks exclude skipped evaluators from validation. Changes: - eval_engine.py: Add _CloudOnlyEvaluatorError sentinel; catch AttributeError in _load_foundry_evaluator_callable and raise cloud-only error; catch in _build_foundry_evaluator_runtimes and warn+skip - runner.py: Change _validate_enabled_evaluators_scored to warn instead of raise for missing scores; skip threshold checks for evaluators with no scores in _evaluate_item_thresholds and _summarize_thresholds_from_items Tested against: - Local callable with agent_workflow_baseline: 3 evaluators skipped, 3 scored successfully - Cloud model-direct: all evaluators submitted to Foundry API - Cloud Foundry agent: all 6 evaluators scored via cloud API Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…noise Follow-up items from issue #79 E2E validation: 1. Add run-agent-local.yaml template to agentops init scaffold — combines type:agent + hosting:local + framework:agent_framework with agent_workflow_baseline bundle for local workflow testing. 2. Document TaskAdherenceEvaluator behavior — SDK expects multi-turn conversation format (list of message dicts with role/content); single-turn plain text inputs produce low scores. Added note to agent_workflow_baseline.yaml bundle description. 3. Suppress 'Conversation history could not be parsed' warning — add logging filter for azure-ai-evaluation SDK loggers that emit this warning on every single-turn row (expected, harmless). 4. Capture callable return tool_calls in results — extract tool_calls from callable/subprocess output and include in backend_metrics.json row_metrics for post-analysis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add agent_framework_adapter.py — a dedicated callable adapter that invokes Azure AI Foundry agents locally via the Foundry REST API. Supports two agent ID patterns: - 'asst_*' IDs: Threads/Runs API (create thread → message → run → poll) - Named agents (e.g. 'FoundryAgent'): Responses API with agent_reference The adapter extracts response text and tool_calls from the API response, matching the contract expected by agent_workflow_baseline evaluators. Also updates: - callable_adapter.py: Add Option 4 pointing to agent_framework_adapter - run-agent-local.yaml: Document both adapter options - initializer.py: Include adapter in agentops init scaffold (26 files) Tested E2E against real FoundryAgent via Responses API — all 5 rows processed with evaluator scores returned. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add multi_agent_workflow.py — a sample multi-agent orchestration callable that uses Azure AI Agent Framework SDK (Assistants API) to dynamically create and coordinate agents: Router Agent → analyzes query, selects specialist WeatherSpecialist → get_weather tool FinanceSpecialist → convert_currency, calculate_compound_interest tools SearchSpecialist → search_news, search_flights tools Agents are created dynamically per evaluation row and cleaned up after use. Tool calls are captured from requires_action/submit flow and returned in the callable response. E2E validated: all 5 smoke-agent-tools rows processed with correct routing (weather/finance/search), tool invocations with correct arguments, and evaluator scores (ToolCallAccuracy: 3.0, IntentResolution: 3.8). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…from org/repo) Add --from flag to 'agentops skills install' for installing community skills from GitHub repositories following the agentskills.io standard. Features: - Parse github:org/repo[@ref] references with version pinning - Download repo tarball (single request, no API pagination) - Extract skill following agentskills.io convention (SKILL.md + references/) - Platform-aware installation (Copilot, Claude, Cursor) - Security: path sanitization, scripts/ blocked by default, traversal prevention - Provenance: .installed-from.json tracks source repo, ref, and installed files - Auth: GITHUB_TOKEN / GH_TOKEN env var support for private repos Usage: agentops skills install --from donlee/pptx-designer agentops skills install --from github:org/repo@v1.0 --platform claude Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 2 lint errors that existed on develop branch: - skills.py: remove unused import 'Any' - test_skills.py: remove unused variable 'result' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- multi_agent_workflow.py: explicit str()/list() casts for SPECIALISTS dict values - local_adapter_backend.py: add assert for adapter_command before shlex.split Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…k SDK Replace OpenAI Assistants API (openai.AzureOpenAI.beta.assistants) with Azure AI Agent Framework SDK (azure-ai-agents AgentsClient): - Use AgentsClient.create_agent() for dynamic agent creation - Use FunctionTool with real Python callables (not JSON schemas) - Use ToolSet + enable_auto_function_calls() for automatic tool execution - Use create_thread_and_process_run() for the agent execution loop - Fix tool function signatures for string args from SDK The multi-agent workflow now properly uses the Agent Framework pattern: AgentsClient → create_agent() → FunctionTool → ToolSet → enable_auto_function_calls() → create_thread_and_process_run() E2E validated: all 5 rows, correct routing, tool_calls captured, evaluators scored (ToolCallAccuracy: 3.0, IntentResolution: 3.6). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace azure-ai-agents AgentsClient with the actual Microsoft Agent Framework SDK (pip install agent-framework agent-framework-foundry): - Agent: creates agents with instructions and @tool-decorated functions - FoundryChatClient: connects to Azure AI Foundry model deployments - Agent.run(): executes agent with automatic tool call handling - @tool decorator: wraps Python functions for Agent Framework The workflow pattern is now: FoundryChatClient → Agent(client, tools=[@tool]) → Agent.run(query) Router Agent.run() → determines specialist → Specialist Agent.run() Tool calls captured via @tool wrapper functions. Agent Framework auto-executes tools and logs 'Function X succeeded'. E2E: 5/5 rows, correct routing, all tools called with correct args. Scores: ToolCallAccuracy 3.0, IntentResolution 3.6. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use 'from agent_framework.foundry import FoundryChatClient' matching the official Microsoft Agent Framework samples, instead of the internal 'agent_framework_foundry' package import. Reference: microsoft/agent-framework samples/03-workflows/ step2_agents_in_a_workflow.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrite multi_agent_workflow.py to follow the official Microsoft Agent Framework workflow pattern (microsoft/agent-framework samples): - WorkflowBuilder with edges connecting Router, Coordinator, Specialists - Custom RoutingCoordinator(Executor) with @handler for routing logic - AgentExecutor wraps each Agent for workflow integration - ctx.send_message() for inter-agent communication - ctx.yield_output() for workflow output Workflow executes as proper supersteps: Superstep 1: Router → Coordinator (routing decision) Superstep 2: Specialist + @tool auto-execution Superstep 3: Coordinator → yield output E2E: 5/5 rows, correct routing, tools executed, evaluators scored. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace azure-ai-agents AgentsClient with Microsoft Agent Framework
FoundryAgent (agent_framework.foundry.FoundryAgent) for pre-deployed
agent evaluation.
Reference: microsoft/agent-framework samples/02-agents/providers/
foundry/foundry_agent_basic.py
E2E validated: 5/5 rows against FoundryAgent, evaluators scored.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ture The adapter was returning only response text with no tool_calls, making tool-related evaluators produce meaningless scores. Now provides local @tool implementations that Agent Framework auto-executes when the Foundry agent makes tool calls. Invocations are captured and returned for evaluator scoring. Flow: run_evaluation(input, context) → FoundryAgent(tools=[get_weather, ...]).run(input) → Agent calls get_weather → Framework auto-executes locally → _captured_tool_calls records the invocation → return {response, tool_calls} Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… tools Replace FoundryAgent (requires server-side tool declarations) with Agent + FoundryChatClient (tools defined entirely in code). Both adapters now use the same Microsoft Agent Framework pattern: - agent_framework_adapter: single Agent with @tool functions - multi_agent_workflow: WorkflowBuilder + Router + Coordinator + Specialists E2E validated both: tool_calls captured in all 5 rows for both adapters. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validates issue #79 — Agent Framework multi-agent workflow evaluation with AgentOps. Fixes cloud-only evaluator crash in local mode, adds Agent Framework multi-agent sample, and documents the evaluator availability gap.
Key Finding: Cloud vs Local Evaluator Limitation
The Foundry Cloud Eval API re-runs prompts against models/agents — it cannot score pre-computed outputs from a callable. This means:
execution_mode: local)execution_mode: remote)Conclusion:
execution_mode: localis the correct path for callable-based workflows. Cloud-only evaluators are gracefully skipped with warnings until theazure-ai-evaluationSDK adds them.Changes
Core Fix: Graceful skip for cloud-only evaluators
eval_engine.py:_CloudOnlyEvaluatorError+ warn+skip in_build_foundry_evaluator_runtimesrunner.py: Warn (not crash) for missing scores; skip threshold checksMulti-Agent Workflow Sample (Microsoft Agent Framework)
multi_agent_workflow.py: Router→Specialist pattern usingagent_framework.Agent+@tool+FoundryChatClient@toolwrapperAdditional Improvements
run-agent-local.yaml: New template for local agent workflowagent_framework_adapter.py: Single-agent Foundry adapter (Threads/Runs + Responses API)callable_adapter.py: Added Agent Framework referencelocal_adapter_backend.py: Capturetool_callsin resultseval_engine.py: Suppress SDK "Conversation history" warningagent_workflow_baseline.yaml: Document evaluator availability + TaskAdherence behaviorEvaluator Availability
E2E Test Results
Local: Agent Framework Multi-Agent Workflow
Cloud: Foundry Agent
CI: ruff ✅ | mypy ✅ | 282 tests passed ✅
Acceptance Criteria (Issue #79)
Closes #79