test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture#2256
Conversation
…-KB fixture vfd_danfoss_04_vlt_fc360_edge is a stochastic out-of-KB edge case (user asks about a nonexistent VLT FC 360). The Q→DIAGNOSIS gate transition is LLM-non-deterministic, so the terminal FSM state flips Q1/Q2/DIAGNOSIS run-to-run. The grader's cp_reached_state then fails whenever it lands at Q1 (< expected Q2) — producing 11/11 spurious FSM-only failures in the #1948 triage even though content/citation checks pass. Add skip_fsm_check: true (the grader's purpose-built flag for exactly this class) so the fixture validates honest out-of-KB behavior via expected_keywords + citation groundedness instead of FSM depth. Mirrors 04_yaskawa_out_of_kb.yaml. Closes follow-up (c) from the #1948 triage (#2222). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🤖 AI Code ReviewReview by: groq (llama-3.3-70b-versatile) Review of PR🔴 IMPORTANT: Security vulnerabilitiesNo security vulnerabilities were found in the provided diff. 🔴 IMPORTANT: Missing error handling on network/IO operationsNo network/IO operations were found in the provided diff. 🟡 WARNING: Logic bugs or incorrect assumptionsThe introduction of 🟡 WARNING: Missing input validation at API boundariesNo API boundaries were found in the provided diff. 🔵 SUGGESTION: Code quality improvements, naming, maintainabilityThe addition of a comment explaining the reason for ✅ GOOD: Noteworthy good practices foundThe use of clear and descriptive comments in the YAML file is a good practice, making it easier to understand the purpose of the fixture and the expected behavior. The reference to issue #2222 provides additional context for the change. Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade) |
MIRA staging gate — ✅ PASSEngine + NeonDB staging branch + Groq cascade against fixed questions, graded on the 5-dimension rubric in
Rubric: |
Version Gate requires a /VERSION bump for non-doc code changes (the .yaml fixture counts). Patch bump (test/fixture fix) + CHANGELOG note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🤖 AI Code ReviewReview by: groq (llama-3.3-70b-versatile) Review of PR: test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture🔴 IMPORTANT: Security vulnerabilitiesNo security vulnerabilities were found in the provided diff. There are no hardcoded secrets, SQL injection, path traversal, or command injection vulnerabilities. 🔴 IMPORTANT: Missing error handling on network/IO operationsNo network/IO operations were found in the provided diff that are missing error handling. 🟡 WARNING: Logic bugs or incorrect assumptionsThe added 🟡 WARNING: Missing input validation at API boundariesNo API boundaries were found in the provided diff that are missing input validation. 🔵 SUGGESTION: Code quality improvements, naming, maintainabilityThe added comment in ✅ GOOD: Noteworthy good practices foundThe diff includes a clear and concise description of the problem, fix, and context in the Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade) |
…04-skip-fsm # Conflicts: # VERSION # docs/CHANGELOG.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…04-skip-fsm # Conflicts: # VERSION
🤖 AI Code ReviewReview by: groq (llama-3.3-70b-versatile) Review of the Pull Request🔴 IMPORTANT: Security vulnerabilitiesNo security vulnerabilities were found in the provided diff. There are no hardcoded secrets, SQL injection, path traversal, or command injection vulnerabilities. 🔴 IMPORTANT: Missing error handlingNo missing error handling on network/IO operations was found in the provided diff. However, it's essential to note that the diff only includes changes to documentation and a test fixture, so the scope for errors is limited. 🟡 WARNING: Logic bugs or incorrect assumptionsThe introduction of 🟡 WARNING: Missing input validationNo missing input validation at API boundaries was found in the provided diff. Since the changes are limited to documentation and a test fixture, there are no API boundaries to validate. 🔵 SUGGESTION: Code quality improvementsThe comments in ✅ GOOD: Noteworthy good practicesThe use of a clear and descriptive commit message and the inclusion of a detailed explanation in the Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade) |
What
Adds
skip_fsm_check: truetotests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml.Why
This is follow-up (c) from the #1948 eval-flakiness triage (detail: PR #2222 /
docs/tech-debt/2026-06-22-eval-1948-flakiness-triage.md).vfd_danfoss_04is a stochastic out-of-KB edge case — the user asks about a nonexistent VLT FC 360 (real model is FC 350). The engine'sQ→DIAGNOSISgate transition is LLM-non-deterministic, so the terminal FSM state flipsQ1/Q2/DIAGNOSISrun-to-run. The grader'scp_reached_statepasses atQ2or beyond but fails atQ1, so the fixture logged 11/11 spurious FSM-only failures in the triage even though keyword + citation checks pass.The grader already has a purpose-built flag for exactly this class:
This fixture's honest-out-of-KB behavior is validated by
expected_keywords(manual,searching,documentation, …) + citation groundedness — not FSM depth. The fix mirrors the existing precedent04_yaskawa_out_of_kb.yaml(Q1+skip_fsm_check: true+max_turns: 3).Scope
One-line fixture change (+ comment). No engine/grader code change. The deeper non-determinism is tracked separately as follow-up (a) — eval-suite determinism via record/replay.
Verification
yaml.safe_loadparses;skip_fsm_check=True,expected_final_state=Q2,max_turns=3.skip_fsm_checkconvention (04_yaskawa_out_of_kb.yaml).🤖 Generated with Claude Code