Run evaluator on local models via MindRouter — internal VM webhook with async commit-back

## Summary

The evaluator in #14 assumes GitHub-hosted runners calling provider APIs directly. That works for cloud-hosted models (Anthropic, OpenAI, Google) but not for our **local models served by MindRouter**, which sits behind VPN. Additionally, running all models inside a GitHub Actions job is constrained by the 6-hour job timeout — fine for most cloud models, potentially tight for larger local-model matrices.

Proposal: stand up a small authenticated web service on our internal VM (the one with a publicly-reachable port) that the GitHub Action calls as a **webhook trigger**. The service runs the evaluator asynchronously against MindRouter-served local models, then commits and pushes `leaderboard.json` updates back to the repo when finished. The GitHub Action fires and exits in seconds — no timeout exposure.

## Architecture

The VM already holds local clones of both `AI4RA/prompt-library` and `AI4RA/evaluation-data-sets`. The webhook payload is therefore trivially small — just a skill name and a replicate count — because the VM has everything else it needs on disk.

```
GitHub Action (trigger)                Internal VM (eval-webhook service)           MindRouter              GitHub
──────────────────────                 ─────────────────────────────────            ──────────              ──────
  POST /evaluate ────────────────▶  auth check → enqueue job → return 202
  { skill_name, replicates }            │
                                        ├─ git -C prompt-library pull
                                        ├─ git -C evaluation-data-sets pull
                                        │
                                        ├─ for each (case, replicate, model):
                                        │    POST MindRouter ────────────────▶
                                        │    ◀──────────── response ──────────
                                        │
                                        ├─ score with deterministic Python
                                        ├─ update local leaderboard.json
                                        ├─ git checkout -b eval-results/...
                                        ├─ git commit + push ───────────────────────────────────────────▶
                                        └─ gh pr create ───────────────────────────────────────────────▶
```

**Key property: almost nothing moves over the wire.** Prompts, fixtures, ground truth, and schemas never cross the webhook boundary — the VM reads them from its own clones. The GitHub Action sends ~50 bytes of JSON; the VM replies with a job ID.

## Components

### 1. Webhook service (new, lives on internal VM)

- Small FastAPI / Flask / Axum app — language is an implementation detail.
- **Endpoints:**
  - `POST /evaluate` — body: `{ skill_name, replicates }`. Validates auth, enqueues job, returns `202` with `{ job_id }`. No `git_sha` or `models` list — the VM pulls HEAD of both repos and evaluates against every local model configured in `.evaluator/models.yaml` for which MindRouter has a backend.
  - `GET /jobs/<id>` — returns job status (`queued` / `running` / `completed` / `failed`) and result summary.
  - `GET /health` — unauthenticated liveness probe.
- **Worker** (same process or a sidecar queue — decide during impl):
  - `git -C <prompt-library> pull --ff-only` and same for the evaluation-data-sets clone.
  - Reads fixtures from the local evaluation-data-sets checkout (same convention as #14).
  - For each `(case, replicate, model)` combination, calls MindRouter.
  - Scores outputs with the same Python scoring library used in #14 (must be shared — do not reimplement metrics).
  - Writes `leaderboard.json` entries locally, commits with a service account on a new branch `eval-results/<skill>-<timestamp>`, pushes, and opens a PR via `gh pr create`.

Because the VM pulls HEAD at the start of the job rather than a fixed SHA, there is a small race: if a new commit lands between trigger and pull, the eval captures the newer state. Acceptable tradeoff for keeping the payload tiny; revisit if it causes confusion.

### 2. GitHub Action trigger

- New workflow that fires on MAJOR/MINOR version bumps (same trigger policy as #14).
- For each changed component, send one `POST /evaluate` with `{ skill_name, replicates }`. One request per skill — the VM handles the full model matrix locally.
- Payload is ~50 bytes. No repo clone on the runner, no large artifacts shipped. A shell step with curl is the entire action (per the curl-only sketch in the comment thread).
- Workflow exits within seconds. The only evidence of success a contributor sees is a follow-up PR opened by the webhook service with the updated `leaderboard.json`.

### 3. Shared scoring library

- #14 should factor scoring (accuracy + reproducibility metrics) into a library module importable from both:
  - `tools/evaluate_skill.py` (cloud models, GitHub-hosted runners)
  - the webhook service (local models, internal VM)
- Treat it as the contract — both execution paths must produce identical `leaderboard.json` entries for identical inputs.

## Concurrency

The webhook service runs **exactly one evaluation at a time** — no queue, no worker pool, single-slot.

- Idle: accepts a `POST /evaluate`, pulls the repos immediately, kicks off the eval, returns `202`.
- Busy: a second `POST /evaluate` returns `409 Conflict` with body `{ busy: true, current_job: { skill, started_at } }`.

The corresponding GitHub Action's curl receives a non-2xx response and **fails red**. No retry. No queue. This is a deliberate design choice — it eliminates the SHA/HEAD race entirely (the `git pull` happens at receive time, matching the triggering commit if Actions are serialized) and keeps VM state trivial.

Cost of this model: if two version-bumping commits land within the eval window (~10 minutes each), the second one's Action fails and that commit never gets evaluated unless someone re-triggers.

Mitigations (pick one during implementation, optional):

- `workflow_dispatch` for manual re-trigger (see the dedicated section below — it's a first-class entrypoint, not an afterthought).
- Nightly sweeper workflow that scans the last 24h of merged commits, identifies MAJOR/MINOR bumps whose components have no matching `leaderboard.json` entry, and re-fires the webhook for each.
- Curl `--retry` with exponential backoff. **Not recommended** — the Action stays running during retry, reintroducing timeout risk that the whole webhook design was meant to avoid.

## Reproducibility: triggering SHA in the payload

Even with busy-rejection eliminating the queue race, two small things motivate passing the triggering commit SHA in the payload:

1. Manual `workflow_dispatch` triggers that target a specific historical commit.
2. The leaderboard's `prompt_commit` field (from #14's schema) should name the exact commit evaluated, not "whatever HEAD was when the VM pulled."

Updated payload (optional field):

```json
{
  "skill_name": "sponsor-doc-defaults-udm",
  "replicates": 10,
  "git_sha": "abc123..."   // optional — defaults to HEAD at pull time
}
```

When `git_sha` is present the VM does `git fetch && git checkout <sha>` (detached HEAD, read-only eval), writes its result, then returns to `main` before the commit-back step on a fresh branch. When absent, the VM does `git pull --ff-only` on `main` and records whatever SHA that resolved to. Payload grows from ~50 bytes to ~90 bytes — still trivial.

## Manual trigger (`workflow_dispatch`)

The trigger workflow must expose a `workflow_dispatch` entrypoint so any contributor can re-fire an eval from the Actions tab without needing a new commit. This is the primary recovery path when an Action fails red because the VM was busy.

Inputs:

- `skill_name` (required, string): must match a `components/<name>/` folder.
- `replicates` (optional, integer, default **10**): same meaning as #14.
- `git_sha` (optional, string): the commit to check out on the VM. Defaults to current `main` HEAD when omitted.

Sketch:

```yaml
on:
  workflow_dispatch:
    inputs:
      skill_name:
        description: Component to evaluate (must match components/<name>/)
        required: true
        type: string
      replicates:
        description: Replicate runs per fixture
        required: false
        default: "10"
        type: string
      git_sha:
        description: Commit SHA to evaluate (defaults to main HEAD)
        required: false
        type: string
  # automatic trigger for MAJOR/MINOR version bumps lives alongside this
```

The body curl assembles the payload from the inputs:

```yaml
      - name: Trigger webhook
        run: |
          curl -fsS -X POST "https://eval.internal.ai4ra/evaluate" \
            -H "Authorization: Bearer ${MINDROUTER_EVAL_WEBHOOK_TOKEN}" \
            -H "Content-Type: application/json" \
            -d "$(jq -n \
                  --arg skill "${{ inputs.skill_name }}" \
                  --argjson replicates ${{ inputs.replicates || 10 }} \
                  --arg sha "${{ inputs.git_sha }}" \
                  '{skill_name: $skill, replicates: $replicates} + (if $sha == "" then {} else {git_sha: $sha} end)')"
        env:
          MINDROUTER_EVAL_WEBHOOK_TOKEN: ${{ secrets.MINDROUTER_EVAL_WEBHOOK_TOKEN }}
```

Use cases covered:

- **Recovery**: an earlier Action failed red because the VM was busy; manually dispatch with the same `skill_name` to re-run.
- **Historical eval**: evaluate an older commit by passing `git_sha`.
- **Bumping replicates on demand**: increase `replicates` for a component whose reproducibility is suspect without changing any defaults.
- **Smoke-testing the webhook**: dispatch against any skill after redeploying the VM service.

## Auth

- Shared secret between GitHub Actions and the webhook service. Store in GitHub Actions secret `MINDROUTER_EVAL_WEBHOOK_TOKEN`.
- Webhook service verifies `Authorization: Bearer <token>` header. Constant-time compare.
- HTTPS required. If the VM's public port doesn't already have TLS, use a cloudflared tunnel, Tailscale Funnel, or a Let's Encrypt cert — do not run plain HTTP.
- Optional hardening: IP allowlist to GitHub Actions IP ranges (rotate with [github/meta](https://api.github.com/meta) periodically).

## Commit-back mechanism

Two viable shapes — decide before implementation:

- **Option A (direct push to `main`):** fastest, minimal plumbing. Webhook uses a deploy key or fine-grained PAT with repo contents: write. Downside: no review gate; bad leaderboard entries ship immediately.
- **Option B (PR):** webhook pushes to a branch `eval-results/<skill>-<git_sha>-<timestamp>` and opens a PR tagged for auto-merge or with a skip-CI label. Cleaner audit trail; slight lag. This is probably the right choice given the Pages dashboard in #15 will render whatever lands on main.

Recommend Option B.

## Relationship to other issues

- **#14** (Python evaluator): defines the scoring library and leaderboard schema that this issue reuses. The webhook service is a *different deployment* of the same evaluation logic — not a fork.
- **#15** (Pages dashboard): reads whatever ends up in the repo. Unaffected by how results get there.
- **evaluation-data-sets#1**: the webhook service reads fixtures from that repo; no new convention needed.

## Acceptance criteria

- [ ] Webhook service deployed to internal VM and reachable on the public port with HTTPS + token auth.
- [ ] Successful end-to-end run: GitHub Action sends `{ skill_name, replicates }` → webhook pulls both repos → runs evaluator against at least one MindRouter-served model → opens PR with updated `leaderboard.json` → merge cleanly.
- [ ] Shared scoring library extracted (common module between `tools/evaluate_skill.py` and the webhook worker).
- [ ] `workflow_dispatch` entrypoint wired with `skill_name` (required), `replicates` (optional, default 10), and `git_sha` (optional) inputs.
- [ ] Auth token rotation documented.
- [ ] Service logs retained (location + retention policy — decide).
- [ ] README updated with the webhook endpoint URL, auth model, and how to add a new local model via `.evaluator/models.yaml`.

## Open questions

- **Where does the webhook service live in this repo?** Options: new top-level dir (`services/eval-webhook/`), separate repo, or internal-only repo. Recommend in-repo for traceability.
- **Queue/worker backing:** single-process in-memory queue (simple, crash-loses-state) vs. lightweight broker (Redis/SQLite). Start in-memory; revisit when concurrency becomes real.
- **Concurrency limit:** how many simultaneous evaluations can the VM handle without swamping MindRouter? Probably a config knob, default 1.
- **Failure reporting:** when a webhook run fails, how does the contributor find out? Options: GitHub Issue opened by the service, PR comment on the triggering PR, Slack webhook. Pick one.
- **Service account / bot identity for the commits back:** dedicated GitHub App vs. machine-user PAT. App is cleaner; PAT is faster to set up.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run evaluator on local models via MindRouter — internal VM webhook with async commit-back #16

Summary

Architecture

Components

1. Webhook service (new, lives on internal VM)

2. GitHub Action trigger

3. Shared scoring library

Concurrency

Reproducibility: triggering SHA in the payload

Manual trigger (`workflow_dispatch`)

Auth

Commit-back mechanism

Relationship to other issues

Acceptance criteria

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run evaluator on local models via MindRouter — internal VM webhook with async commit-back #16

Description

Summary

Architecture

Components

1. Webhook service (new, lives on internal VM)

2. GitHub Action trigger

3. Shared scoring library

Concurrency

Reproducibility: triggering SHA in the payload

Manual trigger (workflow_dispatch)

Auth

Commit-back mechanism

Relationship to other issues

Acceptance criteria

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Manual trigger (`workflow_dispatch`)