hyoka

A comprehensive evaluation tool for AI agents. hyoka runs prompts through AI agents (GitHub Copilot SDK), scores generated outputs against extensible grader criteria, and produces detailed pass/fail reports with workspace delta tracking.

Installation

Prerequisites:

Go 1.26.1+
GitHub Copilot CLI — must be installed and authenticated (setup guide)

From source (recommended for development):

git clone https://github.com/ronniegeraghty/hyoka.git
cd hyoka
go install ./...
hyoka version

From GitHub (latest release):

go install github.com/ronniegeraghty/hyoka@latest
hyoka version

Quick Start

Run your first evaluation in under 5 minutes:

# 1. List available prompts
hyoka list --service key-vault --language python

# 2. Run a single evaluation
hyoka run \
  --prompt-id key-vault-dp-python-crud \
  --config baseline/claude-opus-4.6

# 3. View results in your browser
hyoka serve
# Open http://localhost:8080

What just happened? hyoka:

Loaded the key-vault-dp-python-crud prompt
Spawned a Copilot session with Claude Opus 4.6
Captured the agent's output
Evaluated it against 5 grader types (builder, complexity, prompt adherence, behavior, AI review)
Produced a pass/fail report with detailed grading breakdown

Examples

Sample Prompt

Prompts are markdown files with YAML frontmatter. Here's a minimal example:

---
id: key-vault-dp-python-crud
properties:
  service: key-vault
  plane: data-plane
  language: python
  category: crud
  difficulty: basic
  description: Create, read, update, and delete secrets using Azure Key Vault SDK for Python
  sdk_package: azure-keyvault-secrets
  doc_url: https://learn.microsoft.com/python/api/overview/azure/keyvault-secrets-readme
  created: '2025-01-15'
  author: ronniegeraghty
tags:
- secrets
- crud
---

Write a Python script that demonstrates CRUD operations for Azure Key Vault secrets. The script should:

1. Create a secret with name and value
2. Retrieve the secret value
3. Update the secret value
4. Delete the secret
5. Use DefaultAzureCredential for authentication

See more: Prompt Authoring Guide

Sample Config

Configs define the generator model, reviewer models, and available tools:

configs:
  - name: baseline/claude-opus-4.6
    description: "Claude Opus 4.6 with no MCP or skills"
    generator:
      model: "claude-opus-4.6"
    reviewer:
      models:
        - "claude-opus-4.6"
        - "gpt-5.3-codex"

See more: Configuration Guide

Sample Criteria

Criteria files define grading rules matched by prompt attributes:

attributes:
  service: key-vault
  language: python
criteria:
  - name: SDK Package Import
    type: code_pattern
    pattern: 'from azure\.keyvault\.secrets import'
    weight: 0.15
  - name: DefaultAzureCredential Usage
    type: code_pattern
    pattern: 'DefaultAzureCredential\(\)'
    weight: 0.20

See more: Grader Configuration Schema

Commands

hyoka provides commands for running evaluations, browsing results, and managing prompts. For complete flag documentation, see CLI Reference.

Command	Description
`hyoka run`	Run evaluations against prompts with specified configs
`hyoka list`	List prompts matching filter criteria
`hyoka serve`	Launch local web UI for browsing reports
`hyoka compare`	Compare evaluation results (configs, runs, or time periods)
`hyoka init`	Scaffold a `.hyoka` project directory
`hyoka validate`	Validate prompt frontmatter against schema
`hyoka check-env`	Check for required language toolchains and tools
`hyoka clean`	Remove stale session state and orphaned processes

Filtering prompts:

# By service
hyoka run --service key-vault --config baseline/claude-opus-4.6

# By language
hyoka run --language python --config baseline/claude-opus-4.6

# Combine filters (AND logic)
hyoka run --service storage --language dotnet --plane data-plane \
  --config baseline/claude-opus-4.6

# Single prompt by ID
hyoka run --prompt-id storage-dp-dotnet-auth \
  --config baseline/claude-opus-4.6

# Dry run — list matches without executing
hyoka run --service storage --config baseline/claude-opus-4.6 --dry-run

See all commands and flags: CLI Reference

Safety & Guardrails

hyoka includes built-in protections that keep evaluation runs safe, bounded, and predictable by default.

Generator Guardrails

Every evaluation session is automatically aborted if it exceeds any of these limits:

Limit	Default	Flag	Purpose
Session actions	50	`--max-session-actions`	Limits reasoning, response, and tool call actions per session
File count	50	`--max-files`	Prevents excessive file creation (counts new files + deleted starters)

Prompts can override defaults via frontmatter. Resolution order: prompt frontmatter > config YAML > CLI flag > engine default.

Safety Boundaries

By default, generators prevent real Azure resource provisioning. Agents use:

Mock data, environment variables, and local emulators
Bicep/ARM/Terraform templates instead of live az CLI commands
Placeholder values like os.Getenv("AZURE_STORAGE_CONNECTION_STRING")

Use --allow-cloud to permit real resource provisioning.

Other Protections

Fan-out confirmation: Prompts for confirmation when >10 evaluations would run (skip with -y)
Process lifecycle: Auto-terminates spawned Copilot processes on completion or interrupt (SIGTERM → SIGKILL)
Smart concurrency: Defaults to CPU core count (capped at 8); --max-sessions limits concurrent instances
Prompt discovery: Clear error messages with near-miss detection for misnamed files

See more: Guardrails Documentation

Contributing

We welcome contributions! To get started:

Read the contributing guide: CONTRIBUTING.md
Check the architecture docs: docs/architecture.md
Browse open issues: GitHub Issues

Quick development loop:

# Clone and build
git clone https://github.com/ronniegeraghty/hyoka.git
cd hyoka
go build ./...

# Run tests
go test -race ./...

# Test site (frontend)
cd site && npm test

# Rebuild the site bundle (embedded via go:embed).
# Required whenever you change anything under site/src/** — a CI check
# (site-bundle-freshness) will fail the PR if site/dist/ is stale.
cd site && npm run build

# Test with a live eval (fastest feedback)
hyoka run --prompt-id key-vault-dp-python-crud \
  --config baseline/claude-opus-4.6

See also:

Roadmap — completed phases and what's planned
CLI Reference — all commands and flags
Configuration Guide — config file format and options

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 503 Commits
.agents/skills		.agents/skills
.github		.github
.husky		.husky
.hyoka		.hyoka
.squad		.squad
.vscode		.vscode
configs		configs
criteria/language		criteria/language
docs		docs
examples		examples
hyoka		hyoka
plan		plan
plugins		plugins
prompts		prompts
site		site
skills		skills
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
commit-c3.txt		commit-c3.txt
commit-c4.txt		commit-c4.txt
go.mod		go.mod
go.sum		go.sum
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hyoka

Installation

Quick Start

Examples

Sample Prompt

Sample Config

Sample Criteria

Commands

Safety & Guardrails

Generator Guardrails

Safety Boundaries

Other Protections

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hyoka

Installation

Quick Start

Examples

Sample Prompt

Sample Config

Sample Criteria

Commands

Safety & Guardrails

Generator Guardrails

Safety Boundaries

Other Protections

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages