feat(observe): garage-conveyor FIELD eval pack from real Langfuse questions by Mikecranesync · Pull Request #2185 · Mikecranesync/MIRA

Mikecranesync · 2026-06-21T15:19:34Z

Why

We exported 3 months of real production questions from Langfuse (#2172). This turns the high-value ones into a real regression eval pack — the first eval grounded in actual technician questions rather than a synthetic scenario.

What

simlab/observe/evalpacks/garage_conveyor_field.yaml — 11 items (9 active, 2 held for review) for the GARAGE CONVEYOR (GS10 VFD + Micro820, UNS enterprise.garage.line1.conveyor1):

CE10 comms fault, reset-behavior, overcurrent-on-startup, e-stop safety ("can't confirm a safety circuit remotely"), and the 500 MΩ megger trap (that's good insulation — a model that says the motor is bad fails).
answer_points / unacceptable_answer_patterns encode OEM-correct substance + the hallucination/safety traps, not what MIRA historically answered (~26% of historical answers were ungrounded).
expected_documents / required_citations intentionally empty — the historical corpus didn't ground these, so the pack tests asset resolution, answer substance, hallucination avoidance, and safety governance. missing_citation warns honestly.
approvals.example.json: approves the garage asset so governance gates (incl. the safety item) pass.

Verification

python -m simlab.observe.run_eval garage_conveyor_field → 9/9 PASS, asset/points/citation 100%. Run --live to use it as an engine regression gate.

Notes

Stacked on feat(observe): production-grade observability + evaluation layer (phases 0-3) #2154 (needs simlab/observe); base is feat/observability-eval-layer. Retarget to main after feat(observe): production-grade observability + evaluation layer (phases 0-3) #2154 merges.
691 more unique real questions sit in the export's evalseed-*.yaml (PII-scrubbed, active:false) for future packs.

🤖 Generated with Claude Code

…stions Curated from real production questions exported via tools/langfuse_export.py (PR #2172) — the GARAGE CONVEYOR surface (GS10 VFD + Micro820, UNS enterprise.garage.line1.conveyor1). Unlike conveyor_demo (synthetic SimLab scenario), every question here was actually asked by a technician. - 11 items (9 active, 2 held for review): CE10 comms fault, reset behavior, overcurrent-on-startup, the 500 MΩ megger trap (good insulation, NOT a bad motor), e-stop safety ("can't confirm a safety circuit remotely"), etc. - answer_points / unacceptable_answer_patterns encode OEM-correct substance and the hallucination/safety traps these questions invite — not what MIRA happened to answer (the historical answers were ~26% ungrounded). - expected_documents/required_citations intentionally empty (the corpus didn't ground these); the pack tests asset resolution, answer substance, hallucination avoidance, and safety governance. missing_citation warns honestly. - approvals.example.json: approve enterprise.garage.line1.conveyor1 so governance gates (incl. the safety item) pass. Verified: `python -m simlab.observe.run_eval garage_conveyor_field` → 9/9 PASS, asset/points/citation 100%. Run --live to use it as an engine regression gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CS9fxC3gdSUJDJqHw1uMiu

Mikecranesync mentioned this pull request Jun 21, 2026

Run garage_conveyor_field eval --live on Charlie (staging Neon) + report grounding numbers #2202

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observe): garage-conveyor FIELD eval pack from real Langfuse questions#2185

feat(observe): garage-conveyor FIELD eval pack from real Langfuse questions#2185
Mikecranesync wants to merge 1 commit into
feat/observability-eval-layerfrom
feat/eval-pack-garage-field

Mikecranesync commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mikecranesync commented Jun 21, 2026

Why

What

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant