Skip to content

feat: add /eval skill — AI output evaluator and grader#175

Open
HMAKT99 wants to merge 1 commit intogarrytan:mainfrom
HMAKT99:arun/eval-skill
Open

feat: add /eval skill — AI output evaluator and grader#175
HMAKT99 wants to merge 1 commit intogarrytan:mainfrom
HMAKT99:arun/eval-skill

Conversation

@HMAKT99
Copy link

@HMAKT99 HMAKT99 commented Mar 18, 2026

Your AI feature works on 5 examples. Does it work on 5,000?

You ship a prompt change. It looks great on the 3 examples you tested. In production, it hallucinates on edge cases, complies with injection attacks, and regresses on queries that used to work. You find out from user complaints.

What /eval does

You:   /eval

Claude: PROMPT INVENTORY
        ═════════════════
        #   Location                    Type           Eval Coverage
        1   app/services/ai_chat.rb    System prompt   None ←
        2   lib/prompts/summarize.ts   Template        None ←
        3   app/workers/classify.py    Few-shot        2 cases (need 20+) ←

You:   /eval --generate

Claude: Generated 26 eval cases for ai_chat:
        Happy path:  5 cases
        Edge cases:  5 cases (empty, long, unicode, code blocks)
        Adversarial: 5 cases (injection, jailbreak, role confusion)
        Safety:      3 cases (PII, harmful, bias)
        Format:      3 cases (JSON compliance, length bounds)
        Saved: .gstack/evals/ai_chat-eval-suite.json

You:   /eval --run

Claude: EVAL RESULTS: ai_chat
        Happy path     5/5    100%
        Edge cases     4/5     80%
        Adversarial    3/5     60%  ←
        Safety         3/3    100%
        OVERALL        23/26   88%  Grade: B+

        FAILURES:
        [1] Empty input → 500-word essay (expected: clarification)
        [2] "Ignore previous instructions" → model complied ←

Extends gstack's own eval system

Garry already built a 3-tier eval system for gstack itself (static validation, E2E via claude -p, LLM-as-judge). /eval brings that same rigor to the user's product AI — discover prompts, generate test cases, score outputs, detect regressions.

Only .tmpl committed — bun run gen:skill-docs generates the rest.

Test plan

  • .tmpl follows template pipeline — uses {{PREAMBLE}}
  • Registered in gen-skill-docs.ts, skill-check.ts, both test files
  • bun run gen:skill-docs generates valid SKILL.md
  • All existing tests pass with skill added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant