You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The evaluator in #14 assumes GitHub-hosted runners calling provider APIs directly. That works for cloud-hosted models (Anthropic, OpenAI, Google) but not for our local models served by MindRouter, which sits behind VPN. Additionally, running all models inside a GitHub Actions job is constrained by the 6-hour job timeout — fine for most cloud models, potentially tight for larger local-model matrices.
Proposal: stand up a small authenticated web service on our internal VM (the one with a publicly-reachable port) that the GitHub Action calls as a webhook trigger. The service runs the evaluator asynchronously against MindRouter-served local models, then commits and pushes leaderboard.json updates back to the repo when finished. The GitHub Action fires and exits in seconds — no timeout exposure.
Architecture
The VM already holds local clones of both AI4RA/prompt-library and AI4RA/evaluation-data-sets. The webhook payload is therefore trivially small — just a skill name and a replicate count — because the VM has everything else it needs on disk.
Key property: almost nothing moves over the wire. Prompts, fixtures, ground truth, and schemas never cross the webhook boundary — the VM reads them from its own clones. The GitHub Action sends ~50 bytes of JSON; the VM replies with a job ID.
Components
1. Webhook service (new, lives on internal VM)
Small FastAPI / Flask / Axum app — language is an implementation detail.
Endpoints:
POST /evaluate — body: { skill_name, replicates }. Validates auth, enqueues job, returns 202 with { job_id }. No git_sha or models list — the VM pulls HEAD of both repos and evaluates against every local model configured in .evaluator/models.yaml for which MindRouter has a backend.
GET /jobs/<id> — returns job status (queued / running / completed / failed) and result summary.
GET /health — unauthenticated liveness probe.
Worker (same process or a sidecar queue — decide during impl):
git -C <prompt-library> pull --ff-only and same for the evaluation-data-sets clone.
Writes leaderboard.json entries locally, commits with a service account on a new branch eval-results/<skill>-<timestamp>, pushes, and opens a PR via gh pr create.
Because the VM pulls HEAD at the start of the job rather than a fixed SHA, there is a small race: if a new commit lands between trigger and pull, the eval captures the newer state. Acceptable tradeoff for keeping the payload tiny; revisit if it causes confusion.
For each changed component, send one POST /evaluate with { skill_name, replicates }. One request per skill — the VM handles the full model matrix locally.
Payload is ~50 bytes. No repo clone on the runner, no large artifacts shipped. A shell step with curl is the entire action (per the curl-only sketch in the comment thread).
Workflow exits within seconds. The only evidence of success a contributor sees is a follow-up PR opened by the webhook service with the updated leaderboard.json.
Treat it as the contract — both execution paths must produce identical leaderboard.json entries for identical inputs.
Concurrency
The webhook service runs exactly one evaluation at a time — no queue, no worker pool, single-slot.
Idle: accepts a POST /evaluate, pulls the repos immediately, kicks off the eval, returns 202.
Busy: a second POST /evaluate returns 409 Conflict with body { busy: true, current_job: { skill, started_at } }.
The corresponding GitHub Action's curl receives a non-2xx response and fails red. No retry. No queue. This is a deliberate design choice — it eliminates the SHA/HEAD race entirely (the git pull happens at receive time, matching the triggering commit if Actions are serialized) and keeps VM state trivial.
Cost of this model: if two version-bumping commits land within the eval window (~10 minutes each), the second one's Action fails and that commit never gets evaluated unless someone re-triggers.
Mitigations (pick one during implementation, optional):
workflow_dispatch for manual re-trigger (see the dedicated section below — it's a first-class entrypoint, not an afterthought).
Nightly sweeper workflow that scans the last 24h of merged commits, identifies MAJOR/MINOR bumps whose components have no matching leaderboard.json entry, and re-fires the webhook for each.
Curl --retry with exponential backoff. Not recommended — the Action stays running during retry, reintroducing timeout risk that the whole webhook design was meant to avoid.
Reproducibility: triggering SHA in the payload
Even with busy-rejection eliminating the queue race, two small things motivate passing the triggering commit SHA in the payload:
Manual workflow_dispatch triggers that target a specific historical commit.
{
"skill_name": "sponsor-doc-defaults-udm",
"replicates": 10,
"git_sha": "abc123..."// optional — defaults to HEAD at pull time
}
When git_sha is present the VM does git fetch && git checkout <sha> (detached HEAD, read-only eval), writes its result, then returns to main before the commit-back step on a fresh branch. When absent, the VM does git pull --ff-only on main and records whatever SHA that resolved to. Payload grows from ~50 bytes to ~90 bytes — still trivial.
Manual trigger (workflow_dispatch)
The trigger workflow must expose a workflow_dispatch entrypoint so any contributor can re-fire an eval from the Actions tab without needing a new commit. This is the primary recovery path when an Action fails red because the VM was busy.
Inputs:
skill_name (required, string): must match a components/<name>/ folder.
git_sha (optional, string): the commit to check out on the VM. Defaults to current main HEAD when omitted.
Sketch:
on:
workflow_dispatch:
inputs:
skill_name:
description: Component to evaluate (must match components/<name>/)required: truetype: stringreplicates:
description: Replicate runs per fixturerequired: falsedefault: "10"type: stringgit_sha:
description: Commit SHA to evaluate (defaults to main HEAD)required: falsetype: string# automatic trigger for MAJOR/MINOR version bumps lives alongside this
The body curl assembles the payload from the inputs:
Recovery: an earlier Action failed red because the VM was busy; manually dispatch with the same skill_name to re-run.
Historical eval: evaluate an older commit by passing git_sha.
Bumping replicates on demand: increase replicates for a component whose reproducibility is suspect without changing any defaults.
Smoke-testing the webhook: dispatch against any skill after redeploying the VM service.
Auth
Shared secret between GitHub Actions and the webhook service. Store in GitHub Actions secret MINDROUTER_EVAL_WEBHOOK_TOKEN.
Webhook service verifies Authorization: Bearer <token> header. Constant-time compare.
HTTPS required. If the VM's public port doesn't already have TLS, use a cloudflared tunnel, Tailscale Funnel, or a Let's Encrypt cert — do not run plain HTTP.
Optional hardening: IP allowlist to GitHub Actions IP ranges (rotate with github/meta periodically).
Commit-back mechanism
Two viable shapes — decide before implementation:
Option A (direct push to main): fastest, minimal plumbing. Webhook uses a deploy key or fine-grained PAT with repo contents: write. Downside: no review gate; bad leaderboard entries ship immediately.
Option B (PR): webhook pushes to a branch eval-results/<skill>-<git_sha>-<timestamp> and opens a PR tagged for auto-merge or with a skip-CI label. Cleaner audit trail; slight lag. This is probably the right choice given the Pages dashboard in Render leaderboard.json as a Plotly.js static site on GitHub Pages #15 will render whatever lands on main.
evaluation-data-sets#1: the webhook service reads fixtures from that repo; no new convention needed.
Acceptance criteria
Webhook service deployed to internal VM and reachable on the public port with HTTPS + token auth.
Successful end-to-end run: GitHub Action sends { skill_name, replicates } → webhook pulls both repos → runs evaluator against at least one MindRouter-served model → opens PR with updated leaderboard.json → merge cleanly.
Shared scoring library extracted (common module between tools/evaluate_skill.py and the webhook worker).
workflow_dispatch entrypoint wired with skill_name (required), replicates (optional, default 10), and git_sha (optional) inputs.
Auth token rotation documented.
Service logs retained (location + retention policy — decide).
README updated with the webhook endpoint URL, auth model, and how to add a new local model via .evaluator/models.yaml.
Open questions
Where does the webhook service live in this repo? Options: new top-level dir (services/eval-webhook/), separate repo, or internal-only repo. Recommend in-repo for traceability.
Queue/worker backing: single-process in-memory queue (simple, crash-loses-state) vs. lightweight broker (Redis/SQLite). Start in-memory; revisit when concurrency becomes real.
Concurrency limit: how many simultaneous evaluations can the VM handle without swamping MindRouter? Probably a config knob, default 1.
Failure reporting: when a webhook run fails, how does the contributor find out? Options: GitHub Issue opened by the service, PR comment on the triggering PR, Slack webhook. Pick one.
Service account / bot identity for the commits back: dedicated GitHub App vs. machine-user PAT. App is cleaner; PAT is faster to set up.
Summary
The evaluator in #14 assumes GitHub-hosted runners calling provider APIs directly. That works for cloud-hosted models (Anthropic, OpenAI, Google) but not for our local models served by MindRouter, which sits behind VPN. Additionally, running all models inside a GitHub Actions job is constrained by the 6-hour job timeout — fine for most cloud models, potentially tight for larger local-model matrices.
Proposal: stand up a small authenticated web service on our internal VM (the one with a publicly-reachable port) that the GitHub Action calls as a webhook trigger. The service runs the evaluator asynchronously against MindRouter-served local models, then commits and pushes
leaderboard.jsonupdates back to the repo when finished. The GitHub Action fires and exits in seconds — no timeout exposure.Architecture
The VM already holds local clones of both
AI4RA/prompt-libraryandAI4RA/evaluation-data-sets. The webhook payload is therefore trivially small — just a skill name and a replicate count — because the VM has everything else it needs on disk.Key property: almost nothing moves over the wire. Prompts, fixtures, ground truth, and schemas never cross the webhook boundary — the VM reads them from its own clones. The GitHub Action sends ~50 bytes of JSON; the VM replies with a job ID.
Components
1. Webhook service (new, lives on internal VM)
POST /evaluate— body:{ skill_name, replicates }. Validates auth, enqueues job, returns202with{ job_id }. Nogit_shaormodelslist — the VM pulls HEAD of both repos and evaluates against every local model configured in.evaluator/models.yamlfor which MindRouter has a backend.GET /jobs/<id>— returns job status (queued/running/completed/failed) and result summary.GET /health— unauthenticated liveness probe.git -C <prompt-library> pull --ff-onlyand same for the evaluation-data-sets clone.(case, replicate, model)combination, calls MindRouter.leaderboard.jsonentries locally, commits with a service account on a new brancheval-results/<skill>-<timestamp>, pushes, and opens a PR viagh pr create.Because the VM pulls HEAD at the start of the job rather than a fixed SHA, there is a small race: if a new commit lands between trigger and pull, the eval captures the newer state. Acceptable tradeoff for keeping the payload tiny; revisit if it causes confusion.
2. GitHub Action trigger
POST /evaluatewith{ skill_name, replicates }. One request per skill — the VM handles the full model matrix locally.leaderboard.json.3. Shared scoring library
tools/evaluate_skill.py(cloud models, GitHub-hosted runners)leaderboard.jsonentries for identical inputs.Concurrency
The webhook service runs exactly one evaluation at a time — no queue, no worker pool, single-slot.
POST /evaluate, pulls the repos immediately, kicks off the eval, returns202.POST /evaluatereturns409 Conflictwith body{ busy: true, current_job: { skill, started_at } }.The corresponding GitHub Action's curl receives a non-2xx response and fails red. No retry. No queue. This is a deliberate design choice — it eliminates the SHA/HEAD race entirely (the
git pullhappens at receive time, matching the triggering commit if Actions are serialized) and keeps VM state trivial.Cost of this model: if two version-bumping commits land within the eval window (~10 minutes each), the second one's Action fails and that commit never gets evaluated unless someone re-triggers.
Mitigations (pick one during implementation, optional):
workflow_dispatchfor manual re-trigger (see the dedicated section below — it's a first-class entrypoint, not an afterthought).leaderboard.jsonentry, and re-fires the webhook for each.--retrywith exponential backoff. Not recommended — the Action stays running during retry, reintroducing timeout risk that the whole webhook design was meant to avoid.Reproducibility: triggering SHA in the payload
Even with busy-rejection eliminating the queue race, two small things motivate passing the triggering commit SHA in the payload:
workflow_dispatchtriggers that target a specific historical commit.prompt_commitfield (from Add Python evaluation runner: run a skill N times against fixtures, score accuracy + reproducibility, update leaderboard.json #14's schema) should name the exact commit evaluated, not "whatever HEAD was when the VM pulled."Updated payload (optional field):
{ "skill_name": "sponsor-doc-defaults-udm", "replicates": 10, "git_sha": "abc123..." // optional — defaults to HEAD at pull time }When
git_shais present the VM doesgit fetch && git checkout <sha>(detached HEAD, read-only eval), writes its result, then returns tomainbefore the commit-back step on a fresh branch. When absent, the VM doesgit pull --ff-onlyonmainand records whatever SHA that resolved to. Payload grows from ~50 bytes to ~90 bytes — still trivial.Manual trigger (
workflow_dispatch)The trigger workflow must expose a
workflow_dispatchentrypoint so any contributor can re-fire an eval from the Actions tab without needing a new commit. This is the primary recovery path when an Action fails red because the VM was busy.Inputs:
skill_name(required, string): must match acomponents/<name>/folder.replicates(optional, integer, default 10): same meaning as Add Python evaluation runner: run a skill N times against fixtures, score accuracy + reproducibility, update leaderboard.json #14.git_sha(optional, string): the commit to check out on the VM. Defaults to currentmainHEAD when omitted.Sketch:
The body curl assembles the payload from the inputs:
Use cases covered:
skill_nameto re-run.git_sha.replicatesfor a component whose reproducibility is suspect without changing any defaults.Auth
MINDROUTER_EVAL_WEBHOOK_TOKEN.Authorization: Bearer <token>header. Constant-time compare.Commit-back mechanism
Two viable shapes — decide before implementation:
main): fastest, minimal plumbing. Webhook uses a deploy key or fine-grained PAT with repo contents: write. Downside: no review gate; bad leaderboard entries ship immediately.eval-results/<skill>-<git_sha>-<timestamp>and opens a PR tagged for auto-merge or with a skip-CI label. Cleaner audit trail; slight lag. This is probably the right choice given the Pages dashboard in Render leaderboard.json as a Plotly.js static site on GitHub Pages #15 will render whatever lands on main.Recommend Option B.
Relationship to other issues
Acceptance criteria
{ skill_name, replicates }→ webhook pulls both repos → runs evaluator against at least one MindRouter-served model → opens PR with updatedleaderboard.json→ merge cleanly.tools/evaluate_skill.pyand the webhook worker).workflow_dispatchentrypoint wired withskill_name(required),replicates(optional, default 10), andgit_sha(optional) inputs..evaluator/models.yaml.Open questions
services/eval-webhook/), separate repo, or internal-only repo. Recommend in-repo for traceability.