Migrated from spboyer/waza#385
Summary
Design a shared eval and grader registry for waza — inspired by OpenAI Evals' 800+ community registry but adapted for agent evaluation.
Context
From docs/research/waza-vs-openai-evals.md, the registry gap is waza's #1 competitive disadvantage (Row 10). This epic covers the full design.
Sub-issues
Peter's Ideas (verbatim)
- The registry of shared evals is interesting. Graders particularly.
- OpenAI's are all in their repo as YAML files
- Consuming their format could be interesting
- Go module style: just point to a repo and that is your grader or eval
- Being able to construct your eval from a set of known graders is interesting
Deliverable
Design document at docs/research/waza-eval-registry-design.md — design only, no implementation.
Non-goals (for now)
- Implementation — this is design research only
- NOT a single JSON file for the registry — needs to be more robust
- Not building the actual CLI commands yet
Summary
Design a shared eval and grader registry for waza — inspired by OpenAI Evals' 800+ community registry but adapted for agent evaluation.
Context
From
docs/research/waza-vs-openai-evals.md, the registry gap is waza's #1 competitive disadvantage (Row 10). This epic covers the full design.Sub-issues
Peter's Ideas (verbatim)
Deliverable
Design document at
docs/research/waza-eval-registry-design.md— design only, no implementation.Non-goals (for now)