Skip to content

JulianePadrao/learning-lab-agents-evals

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning lab agents evals

Objective: Playground for experimenting with automated and manual evaluations for learning lab agents.

This week's todos

Who Todo
Ivor Create a GitHub Action that generates a response with an agent based on smoke-test.jsonl and stores a new JSONL in results with the added response data
Juliane Research the optimal way to compare response to ground truth — what should go in evaluate_agent.py, preferably using something lightweight. Note: the existing evaluate_agent.py is just reference/inspiration; a new script should probably be created
MJ Create a representative test dataset

Repository structure

learning-lab-agents-evals/
├── readme.md
├── .github/workflows/
│   └── run-evaluation.yml
├── agents/
│   ├── drift-detector.agent.md
│   ├── freshness-report.agent.md
│   ├── learn-content-planner.agent.md
│   ├── learn-module-writer.agent.md
│   ├── learn-unit-writer.agent.md
│   ├── light-unit-writer.agent.md
│   └── skilling-session-pptx.agent.md
├── evaluation/
│   ├── scripts/
│   │   └── evaluate_agent.py
│   └── data/
│       ├── technical-accuracy-results/
│       │   └── smoke-test-results.jsonl
│       └── technical-accuracy-tests/
│           └── smoke-test.jsonl
└── reports/

.github/workflows/run-evaluation.yml

GitHub Actions workflow that runs evaluations on demand. Takes a test file, agent file, and results file as inputs, calls the agent for each query, and commits the results.

agents/

The agent definition files being tested. Each .agent.md defines a learning lab agent whose system prompt is used during evaluation.

evaluation/scripts/evaluate_agent.py

Orchestrates a full evaluation run for a single agent. Generates responses for each query and scores them against ground truth, producing a result JSONL and a report in one step.

evaluation/data/technical-accuracy-tests/

Input JSONL files. Each line has a query and ground_truth.

evaluation/data/technical-accuracy-results/

Output JSONL files produced by evaluate_agent.py. Each line has query, response, and ground_truth.

reports/

Markdown reports generated by evaluate_agent.py with pass rates and per-query scores.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 64.0%
  • HTML 36.0%