Intro
Evaluating the performance and expectations of a mission is essential to ensuring consistency across runs and protecting against drift. Because squadron is purpose built to go through a series of logical gateways and creates structured data outputs, evals naturally fit into the model nicely.
Goal
Great evals test outcomes. They are not intended to assess the reasoning or logic used to accomplish the outcomes. Evals will focus on evaluating these outcomes. Based on eval results, developers can debug their own missions by going deeper into the reason and logic to understand how it got to those outcomes.
Solution Proposal
Add a config-driven evaluation (eval) framework. This framework will allow users of squadron to create test cases in the form of LLM-as-a-judge workflows, complete with measures to rate the success, quality, or categorize/label aspects of missions based on predefined criteria.
On the squadron side, there will be an "eval" agent, designed similar to commander or regular agent, but with it's own set of tools and own system prompts.
Evals abilities
- to create test cases that measure a variety of conditions, to do LLM-as-a-judge sort of assessments (can be true/false, categories, or ranges)
- to have these run 1 every x mission runs
- to run an eval based on conditions targeting historical runs of a mission (largely date based, or "last 5 runs" or something like that)
- to focus on actions, outputs, datasets, and final answers. The Judges will not have access to the session itself, just the actions and data it called or created along the way
- to scope by a task, or full mission
- to verify certain conditionals where triggered
Supporting abilities
- trigger an eval, which simply means run the mission, then run the eval
- eval results analytics page for each eval group (for command center)
Intro
Evaluating the performance and expectations of a mission is essential to ensuring consistency across runs and protecting against drift. Because squadron is purpose built to go through a series of logical gateways and creates structured data outputs, evals naturally fit into the model nicely.
Goal
Great evals test outcomes. They are not intended to assess the reasoning or logic used to accomplish the outcomes. Evals will focus on evaluating these outcomes. Based on eval results, developers can debug their own missions by going deeper into the reason and logic to understand how it got to those outcomes.
Solution Proposal
Add a config-driven evaluation (eval) framework. This framework will allow users of squadron to create test cases in the form of LLM-as-a-judge workflows, complete with measures to rate the success, quality, or categorize/label aspects of missions based on predefined criteria.
On the squadron side, there will be an "eval" agent, designed similar to commander or regular agent, but with it's own set of tools and own system prompts.
Evals abilities
Supporting abilities