Evals

### Intro

Evaluating the performance and expectations of a mission is essential to ensuring consistency across runs and protecting against drift. Because squadron is purpose built to go through a series of logical gateways and creates structured data outputs, evals naturally fit into the model nicely.

### Goal

Great evals test outcomes. They are not intended to assess the reasoning or logic used to accomplish the outcomes. Evals will focus on evaluating these outcomes. Based on eval results, developers can debug their own missions by going deeper into the reason and logic to understand how it got to those outcomes.

### Solution Proposal

Add a config-driven evaluation (eval) framework. This framework will allow users of squadron to create test cases in the form of LLM-as-a-judge workflows, complete with measures to rate the success, quality, or categorize/label aspects of missions based on predefined criteria.

On the squadron side, there will be an "eval" agent, designed similar to commander or regular agent, but with it's own set of tools and own system prompts.

### Evals abilities

- to create test cases that measure a variety of conditions, to do LLM-as-a-judge sort of assessments (can be true/false, categories, or ranges)
- to have these run 1 every x mission runs
- to run an eval based on conditions targeting historical runs of a mission (largely date based, or "last 5 runs" or something like that)
- to focus on actions, outputs, datasets, and final answers. The Judges will not have access to the session itself, just the actions and data it called or created along the way
- to scope by a task, or full mission
- to verify certain conditionals where triggered

### Supporting abilities
- trigger an eval, which simply means run the mission, then run the eval
- eval results analytics page for each eval group (for command center)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evals #11

Intro

Goal

Solution Proposal

Evals abilities

Supporting abilities

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evals #11

Description

Intro

Goal

Solution Proposal

Evals abilities

Supporting abilities

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions