Skip to content

feat(observe): garage-conveyor FIELD eval pack from real Langfuse questions#2185

Open
Mikecranesync wants to merge 1 commit into
feat/observability-eval-layerfrom
feat/eval-pack-garage-field
Open

feat(observe): garage-conveyor FIELD eval pack from real Langfuse questions#2185
Mikecranesync wants to merge 1 commit into
feat/observability-eval-layerfrom
feat/eval-pack-garage-field

Conversation

@Mikecranesync

Copy link
Copy Markdown
Owner

Why

We exported 3 months of real production questions from Langfuse (#2172). This turns the high-value ones into a real regression eval pack — the first eval grounded in actual technician questions rather than a synthetic scenario.

What

simlab/observe/evalpacks/garage_conveyor_field.yaml — 11 items (9 active, 2 held for review) for the GARAGE CONVEYOR (GS10 VFD + Micro820, UNS enterprise.garage.line1.conveyor1):

  • CE10 comms fault, reset-behavior, overcurrent-on-startup, e-stop safety ("can't confirm a safety circuit remotely"), and the 500 MΩ megger trap (that's good insulation — a model that says the motor is bad fails).
  • answer_points / unacceptable_answer_patterns encode OEM-correct substance + the hallucination/safety traps, not what MIRA historically answered (~26% of historical answers were ungrounded).
  • expected_documents / required_citations intentionally empty — the historical corpus didn't ground these, so the pack tests asset resolution, answer substance, hallucination avoidance, and safety governance. missing_citation warns honestly.
  • approvals.example.json: approves the garage asset so governance gates (incl. the safety item) pass.

Verification

python -m simlab.observe.run_eval garage_conveyor_field9/9 PASS, asset/points/citation 100%. Run --live to use it as an engine regression gate.

Notes

🤖 Generated with Claude Code

…stions

Curated from real production questions exported via tools/langfuse_export.py
(PR #2172) — the GARAGE CONVEYOR surface (GS10 VFD + Micro820, UNS
enterprise.garage.line1.conveyor1). Unlike conveyor_demo (synthetic SimLab
scenario), every question here was actually asked by a technician.

- 11 items (9 active, 2 held for review): CE10 comms fault, reset behavior,
  overcurrent-on-startup, the 500 MΩ megger trap (good insulation, NOT a bad
  motor), e-stop safety ("can't confirm a safety circuit remotely"), etc.
- answer_points / unacceptable_answer_patterns encode OEM-correct substance and
  the hallucination/safety traps these questions invite — not what MIRA happened
  to answer (the historical answers were ~26% ungrounded).
- expected_documents/required_citations intentionally empty (the corpus didn't
  ground these); the pack tests asset resolution, answer substance, hallucination
  avoidance, and safety governance. missing_citation warns honestly.
- approvals.example.json: approve enterprise.garage.line1.conveyor1 so governance
  gates (incl. the safety item) pass.

Verified: `python -m simlab.observe.run_eval garage_conveyor_field` → 9/9 PASS,
asset/points/citation 100%. Run --live to use it as an engine regression gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CS9fxC3gdSUJDJqHw1uMiu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant