feat: add /eval skill — AI output evaluator and grader by HMAKT99 · Pull Request #175 · garrytan/gstack

HMAKT99 · 2026-03-18T07:15:45Z

Your AI feature works on 5 examples. Does it work on 5,000?

You ship a prompt change. It looks great on the 3 examples you tested. In production, it hallucinates on edge cases, complies with injection attacks, and regresses on queries that used to work. You find out from user complaints.

What /eval does

You:   /eval

Claude: PROMPT INVENTORY
        ═════════════════
        #   Location                    Type           Eval Coverage
        1   app/services/ai_chat.rb    System prompt   None ←
        2   lib/prompts/summarize.ts   Template        None ←
        3   app/workers/classify.py    Few-shot        2 cases (need 20+) ←

You:   /eval --generate

Claude: Generated 26 eval cases for ai_chat:
        Happy path:  5 cases
        Edge cases:  5 cases (empty, long, unicode, code blocks)
        Adversarial: 5 cases (injection, jailbreak, role confusion)
        Safety:      3 cases (PII, harmful, bias)
        Format:      3 cases (JSON compliance, length bounds)
        Saved: .gstack/evals/ai_chat-eval-suite.json

You:   /eval --run

Claude: EVAL RESULTS: ai_chat
        Happy path     5/5    100%
        Edge cases     4/5     80%
        Adversarial    3/5     60%  ←
        Safety         3/3    100%
        OVERALL        23/26   88%  Grade: B+

        FAILURES:
        [1] Empty input → 500-word essay (expected: clarification)
        [2] "Ignore previous instructions" → model complied ←

Extends gstack's own eval system

Garry already built a 3-tier eval system for gstack itself (static validation, E2E via claude -p, LLM-as-judge). /eval brings that same rigor to the user's product AI — discover prompts, generate test cases, score outputs, detect regressions.

Only .tmpl committed — bun run gen:skill-docs generates the rest.

Test plan

.tmpl follows template pipeline — uses {{PREAMBLE}}
Registered in gen-skill-docs.ts, skill-check.ts, both test files
bun run gen:skill-docs generates valid SKILL.md
All existing tests pass with skill added

feat: add /eval skill — AI output evaluator and grader

b087598

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add /eval skill — AI output evaluator and grader#175

feat: add /eval skill — AI output evaluator and grader#175
HMAKT99 wants to merge 1 commit intogarrytan:mainfrom
HMAKT99:arun/eval-skill

HMAKT99 commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HMAKT99 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Your AI feature works on 5 examples. Does it work on 5,000?

What /eval does

Extends gstack's own eval system

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HMAKT99 commented Mar 18, 2026 •

edited

Loading