Skip to content

Run evaluator on local models via MindRouter — internal VM webhook with async commit-back #16

@nate-layman

Description

@nate-layman

Summary

The evaluator in #14 assumes GitHub-hosted runners calling provider APIs directly. That works for cloud-hosted models (Anthropic, OpenAI, Google) but not for our local models served by MindRouter, which sits behind VPN. Additionally, running all models inside a GitHub Actions job is constrained by the 6-hour job timeout — fine for most cloud models, potentially tight for larger local-model matrices.

Proposal: stand up a small authenticated web service on our internal VM (the one with a publicly-reachable port) that the GitHub Action calls as a webhook trigger. The service runs the evaluator asynchronously against MindRouter-served local models, then commits and pushes leaderboard.json updates back to the repo when finished. The GitHub Action fires and exits in seconds — no timeout exposure.

Architecture

The VM already holds local clones of both AI4RA/prompt-library and AI4RA/evaluation-data-sets. The webhook payload is therefore trivially small — just a skill name and a replicate count — because the VM has everything else it needs on disk.

GitHub Action (trigger)                Internal VM (eval-webhook service)           MindRouter              GitHub
──────────────────────                 ─────────────────────────────────            ──────────              ──────
  POST /evaluate ────────────────▶  auth check → enqueue job → return 202
  { skill_name, replicates }            │
                                        ├─ git -C prompt-library pull
                                        ├─ git -C evaluation-data-sets pull
                                        │
                                        ├─ for each (case, replicate, model):
                                        │    POST MindRouter ────────────────▶
                                        │    ◀──────────── response ──────────
                                        │
                                        ├─ score with deterministic Python
                                        ├─ update local leaderboard.json
                                        ├─ git checkout -b eval-results/...
                                        ├─ git commit + push ───────────────────────────────────────────▶
                                        └─ gh pr create ───────────────────────────────────────────────▶

Key property: almost nothing moves over the wire. Prompts, fixtures, ground truth, and schemas never cross the webhook boundary — the VM reads them from its own clones. The GitHub Action sends ~50 bytes of JSON; the VM replies with a job ID.

Components

1. Webhook service (new, lives on internal VM)

  • Small FastAPI / Flask / Axum app — language is an implementation detail.
  • Endpoints:
    • POST /evaluate — body: { skill_name, replicates }. Validates auth, enqueues job, returns 202 with { job_id }. No git_sha or models list — the VM pulls HEAD of both repos and evaluates against every local model configured in .evaluator/models.yaml for which MindRouter has a backend.
    • GET /jobs/<id> — returns job status (queued / running / completed / failed) and result summary.
    • GET /health — unauthenticated liveness probe.
  • Worker (same process or a sidecar queue — decide during impl):

Because the VM pulls HEAD at the start of the job rather than a fixed SHA, there is a small race: if a new commit lands between trigger and pull, the eval captures the newer state. Acceptable tradeoff for keeping the payload tiny; revisit if it causes confusion.

2. GitHub Action trigger

  • New workflow that fires on MAJOR/MINOR version bumps (same trigger policy as Add Python evaluation runner: run a skill N times against fixtures, score accuracy + reproducibility, update leaderboard.json #14).
  • For each changed component, send one POST /evaluate with { skill_name, replicates }. One request per skill — the VM handles the full model matrix locally.
  • Payload is ~50 bytes. No repo clone on the runner, no large artifacts shipped. A shell step with curl is the entire action (per the curl-only sketch in the comment thread).
  • Workflow exits within seconds. The only evidence of success a contributor sees is a follow-up PR opened by the webhook service with the updated leaderboard.json.

3. Shared scoring library

Concurrency

The webhook service runs exactly one evaluation at a time — no queue, no worker pool, single-slot.

  • Idle: accepts a POST /evaluate, pulls the repos immediately, kicks off the eval, returns 202.
  • Busy: a second POST /evaluate returns 409 Conflict with body { busy: true, current_job: { skill, started_at } }.

The corresponding GitHub Action's curl receives a non-2xx response and fails red. No retry. No queue. This is a deliberate design choice — it eliminates the SHA/HEAD race entirely (the git pull happens at receive time, matching the triggering commit if Actions are serialized) and keeps VM state trivial.

Cost of this model: if two version-bumping commits land within the eval window (~10 minutes each), the second one's Action fails and that commit never gets evaluated unless someone re-triggers.

Mitigations (pick one during implementation, optional):

  • workflow_dispatch for manual re-trigger (see the dedicated section below — it's a first-class entrypoint, not an afterthought).
  • Nightly sweeper workflow that scans the last 24h of merged commits, identifies MAJOR/MINOR bumps whose components have no matching leaderboard.json entry, and re-fires the webhook for each.
  • Curl --retry with exponential backoff. Not recommended — the Action stays running during retry, reintroducing timeout risk that the whole webhook design was meant to avoid.

Reproducibility: triggering SHA in the payload

Even with busy-rejection eliminating the queue race, two small things motivate passing the triggering commit SHA in the payload:

  1. Manual workflow_dispatch triggers that target a specific historical commit.
  2. The leaderboard's prompt_commit field (from Add Python evaluation runner: run a skill N times against fixtures, score accuracy + reproducibility, update leaderboard.json #14's schema) should name the exact commit evaluated, not "whatever HEAD was when the VM pulled."

Updated payload (optional field):

{
  "skill_name": "sponsor-doc-defaults-udm",
  "replicates": 10,
  "git_sha": "abc123..."   // optional — defaults to HEAD at pull time
}

When git_sha is present the VM does git fetch && git checkout <sha> (detached HEAD, read-only eval), writes its result, then returns to main before the commit-back step on a fresh branch. When absent, the VM does git pull --ff-only on main and records whatever SHA that resolved to. Payload grows from ~50 bytes to ~90 bytes — still trivial.

Manual trigger (workflow_dispatch)

The trigger workflow must expose a workflow_dispatch entrypoint so any contributor can re-fire an eval from the Actions tab without needing a new commit. This is the primary recovery path when an Action fails red because the VM was busy.

Inputs:

Sketch:

on:
  workflow_dispatch:
    inputs:
      skill_name:
        description: Component to evaluate (must match components/<name>/)
        required: true
        type: string
      replicates:
        description: Replicate runs per fixture
        required: false
        default: "10"
        type: string
      git_sha:
        description: Commit SHA to evaluate (defaults to main HEAD)
        required: false
        type: string
  # automatic trigger for MAJOR/MINOR version bumps lives alongside this

The body curl assembles the payload from the inputs:

      - name: Trigger webhook
        run: |
          curl -fsS -X POST "https://eval.internal.ai4ra/evaluate" \
            -H "Authorization: Bearer ${MINDROUTER_EVAL_WEBHOOK_TOKEN}" \
            -H "Content-Type: application/json" \
            -d "$(jq -n \
                  --arg skill "${{ inputs.skill_name }}" \
                  --argjson replicates ${{ inputs.replicates || 10 }} \
                  --arg sha "${{ inputs.git_sha }}" \
                  '{skill_name: $skill, replicates: $replicates} + (if $sha == "" then {} else {git_sha: $sha} end)')"
        env:
          MINDROUTER_EVAL_WEBHOOK_TOKEN: ${{ secrets.MINDROUTER_EVAL_WEBHOOK_TOKEN }}

Use cases covered:

  • Recovery: an earlier Action failed red because the VM was busy; manually dispatch with the same skill_name to re-run.
  • Historical eval: evaluate an older commit by passing git_sha.
  • Bumping replicates on demand: increase replicates for a component whose reproducibility is suspect without changing any defaults.
  • Smoke-testing the webhook: dispatch against any skill after redeploying the VM service.

Auth

  • Shared secret between GitHub Actions and the webhook service. Store in GitHub Actions secret MINDROUTER_EVAL_WEBHOOK_TOKEN.
  • Webhook service verifies Authorization: Bearer <token> header. Constant-time compare.
  • HTTPS required. If the VM's public port doesn't already have TLS, use a cloudflared tunnel, Tailscale Funnel, or a Let's Encrypt cert — do not run plain HTTP.
  • Optional hardening: IP allowlist to GitHub Actions IP ranges (rotate with github/meta periodically).

Commit-back mechanism

Two viable shapes — decide before implementation:

  • Option A (direct push to main): fastest, minimal plumbing. Webhook uses a deploy key or fine-grained PAT with repo contents: write. Downside: no review gate; bad leaderboard entries ship immediately.
  • Option B (PR): webhook pushes to a branch eval-results/<skill>-<git_sha>-<timestamp> and opens a PR tagged for auto-merge or with a skip-CI label. Cleaner audit trail; slight lag. This is probably the right choice given the Pages dashboard in Render leaderboard.json as a Plotly.js static site on GitHub Pages #15 will render whatever lands on main.

Recommend Option B.

Relationship to other issues

Acceptance criteria

  • Webhook service deployed to internal VM and reachable on the public port with HTTPS + token auth.
  • Successful end-to-end run: GitHub Action sends { skill_name, replicates } → webhook pulls both repos → runs evaluator against at least one MindRouter-served model → opens PR with updated leaderboard.json → merge cleanly.
  • Shared scoring library extracted (common module between tools/evaluate_skill.py and the webhook worker).
  • workflow_dispatch entrypoint wired with skill_name (required), replicates (optional, default 10), and git_sha (optional) inputs.
  • Auth token rotation documented.
  • Service logs retained (location + retention policy — decide).
  • README updated with the webhook endpoint URL, auth model, and how to add a new local model via .evaluator/models.yaml.

Open questions

  • Where does the webhook service live in this repo? Options: new top-level dir (services/eval-webhook/), separate repo, or internal-only repo. Recommend in-repo for traceability.
  • Queue/worker backing: single-process in-memory queue (simple, crash-loses-state) vs. lightweight broker (Redis/SQLite). Start in-memory; revisit when concurrency becomes real.
  • Concurrency limit: how many simultaneous evaluations can the VM handle without swamping MindRouter? Probably a config knob, default 1.
  • Failure reporting: when a webhook run fails, how does the contributor find out? Options: GitHub Issue opened by the service, PR comment on the triggering PR, Slack webhook. Pick one.
  • Service account / bot identity for the commits back: dedicated GitHub App vs. machine-user PAT. App is cleaner; PAT is faster to set up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions