Learning lab agents evals

Objective: Playground for experimenting with automated and manual evaluations for learning lab agents.

This week's todos

Who	Todo
Ivor	Create a GitHub Action that generates a response with an agent based on `smoke-test.jsonl` and stores a new JSONL in results with the added response data
Juliane	Research the optimal way to compare response to ground truth — what should go in `evaluate_agent.py`, preferably using something lightweight. Note: the existing `evaluate_agent.py` is just reference/inspiration; a new script should probably be created
MJ	Create a representative test dataset

Repository structure

learning-lab-agents-evals/
├── readme.md
├── .github/workflows/
│   └── run-evaluation.yml
├── agents/
│   ├── drift-detector.agent.md
│   ├── freshness-report.agent.md
│   ├── learn-content-planner.agent.md
│   ├── learn-module-writer.agent.md
│   ├── learn-unit-writer.agent.md
│   ├── light-unit-writer.agent.md
│   └── skilling-session-pptx.agent.md
├── evaluation/
│   ├── scripts/
│   │   └── evaluate_agent.py
│   └── data/
│       ├── technical-accuracy-results/
│       │   └── smoke-test-results.jsonl
│       └── technical-accuracy-tests/
│           └── smoke-test.jsonl
└── reports/

`.github/workflows/run-evaluation.yml`

GitHub Actions workflow that runs evaluations on demand. Takes a test file, agent file, and results file as inputs, calls the agent for each query, and commits the results.

`agents/`

The agent definition files being tested. Each .agent.md defines a learning lab agent whose system prompt is used during evaluation.

`evaluation/scripts/evaluate_agent.py`

Orchestrates a full evaluation run for a single agent. Generates responses for each query and scores them against ground truth, producing a result JSONL and a report in one step.

`evaluation/data/technical-accuracy-tests/`

Input JSONL files. Each line has a query and ground_truth.

`evaluation/data/technical-accuracy-results/`

Output JSONL files produced by evaluate_agent.py. Each line has query, response, and ground_truth.

`reports/`

Markdown reports generated by evaluate_agent.py with pass rates and per-query scores.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
agents		agents
evaluation		evaluation
reports		reports
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning lab agents evals

This week's todos

Repository structure

`.github/workflows/run-evaluation.yml`

`agents/`

`evaluation/scripts/evaluate_agent.py`

`evaluation/data/technical-accuracy-tests/`

`evaluation/data/technical-accuracy-results/`

`reports/`

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Learning lab agents evals

This week's todos

Repository structure

.github/workflows/run-evaluation.yml

agents/

evaluation/scripts/evaluate_agent.py

evaluation/data/technical-accuracy-tests/

evaluation/data/technical-accuracy-results/

reports/

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`.github/workflows/run-evaluation.yml`

`agents/`

`evaluation/scripts/evaluate_agent.py`

`evaluation/data/technical-accuracy-tests/`

`evaluation/data/technical-accuracy-results/`

`reports/`

Packages