Objective: Playground for experimenting with automated and manual evaluations for learning lab agents.
| Who | Todo |
|---|---|
| Ivor | Create a GitHub Action that generates a response with an agent based on smoke-test.jsonl and stores a new JSONL in results with the added response data |
| Juliane | Research the optimal way to compare response to ground truth — what should go in evaluate_agent.py, preferably using something lightweight. Note: the existing evaluate_agent.py is just reference/inspiration; a new script should probably be created |
| MJ | Create a representative test dataset |
learning-lab-agents-evals/
├── readme.md
├── .github/workflows/
│ └── run-evaluation.yml
├── agents/
│ ├── drift-detector.agent.md
│ ├── freshness-report.agent.md
│ ├── learn-content-planner.agent.md
│ ├── learn-module-writer.agent.md
│ ├── learn-unit-writer.agent.md
│ ├── light-unit-writer.agent.md
│ └── skilling-session-pptx.agent.md
├── evaluation/
│ ├── scripts/
│ │ └── evaluate_agent.py
│ └── data/
│ ├── technical-accuracy-results/
│ │ └── smoke-test-results.jsonl
│ └── technical-accuracy-tests/
│ └── smoke-test.jsonl
└── reports/
GitHub Actions workflow that runs evaluations on demand. Takes a test file, agent file, and results file as inputs, calls the agent for each query, and commits the results.
The agent definition files being tested. Each .agent.md defines a learning lab agent whose system prompt is used during evaluation.
Orchestrates a full evaluation run for a single agent. Generates responses for each query and scores them against ground truth, producing a result JSONL and a report in one step.
Input JSONL files. Each line has a query and ground_truth.
Output JSONL files produced by evaluate_agent.py. Each line has query, response, and ground_truth.
Markdown reports generated by evaluate_agent.py with pass rates and per-query scores.