Skip to content

Evals #11

@mlund01

Description

@mlund01

Intro

Evaluating the performance and expectations of a mission is essential to ensuring consistency across runs and protecting against drift. Because squadron is purpose built to go through a series of logical gateways and creates structured data outputs, evals naturally fit into the model nicely.

Goal

Great evals test outcomes. They are not intended to assess the reasoning or logic used to accomplish the outcomes. Evals will focus on evaluating these outcomes. Based on eval results, developers can debug their own missions by going deeper into the reason and logic to understand how it got to those outcomes.

Solution Proposal

Add a config-driven evaluation (eval) framework. This framework will allow users of squadron to create test cases in the form of LLM-as-a-judge workflows, complete with measures to rate the success, quality, or categorize/label aspects of missions based on predefined criteria.

On the squadron side, there will be an "eval" agent, designed similar to commander or regular agent, but with it's own set of tools and own system prompts.

Evals abilities

  • to create test cases that measure a variety of conditions, to do LLM-as-a-judge sort of assessments (can be true/false, categories, or ranges)
  • to have these run 1 every x mission runs
  • to run an eval based on conditions targeting historical runs of a mission (largely date based, or "last 5 runs" or something like that)
  • to focus on actions, outputs, datasets, and final answers. The Judges will not have access to the session itself, just the actions and data it called or created along the way
  • to scope by a task, or full mission
  • to verify certain conditionals where triggered

Supporting abilities

  • trigger an eval, which simply means run the mission, then run the eval
  • eval results analytics page for each eval group (for command center)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions