Skip to content

test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture#2256

Merged
Mikecranesync merged 5 commits into
mainfrom
fix/eval-vfd-danfoss-04-skip-fsm
Jun 22, 2026
Merged

test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture#2256
Mikecranesync merged 5 commits into
mainfrom
fix/eval-vfd-danfoss-04-skip-fsm

Conversation

@Mikecranesync

Copy link
Copy Markdown
Owner

What

Adds skip_fsm_check: true to tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml.

Why

This is follow-up (c) from the #1948 eval-flakiness triage (detail: PR #2222 / docs/tech-debt/2026-06-22-eval-1948-flakiness-triage.md).

vfd_danfoss_04 is a stochastic out-of-KB edge case — the user asks about a nonexistent VLT FC 360 (real model is FC 350). The engine's Q→DIAGNOSIS gate transition is LLM-non-deterministic, so the terminal FSM state flips Q1 / Q2 / DIAGNOSIS run-to-run. The grader's cp_reached_state passes at Q2 or beyond but fails at Q1, so the fixture logged 11/11 spurious FSM-only failures in the triage even though keyword + citation checks pass.

The grader already has a purpose-built flag for exactly this class:

skip_fsm_check: for stochastic out-of-KB scenarios where FSM state is non-deterministic but content correctness (honesty) is validated by cp_keyword_match and cp_citation_groundedness. — tests/eval/grader.py:110

This fixture's honest-out-of-KB behavior is validated by expected_keywords (manual, searching, documentation, …) + citation groundedness — not FSM depth. The fix mirrors the existing precedent 04_yaskawa_out_of_kb.yaml (Q1 + skip_fsm_check: true + max_turns: 3).

Scope

One-line fixture change (+ comment). No engine/grader code change. The deeper non-determinism is tracked separately as follow-up (a) — eval-suite determinism via record/replay.

Verification

  • yaml.safe_load parses; skip_fsm_check=True, expected_final_state=Q2, max_turns=3.
  • Matches the established skip_fsm_check convention (04_yaskawa_out_of_kb.yaml).

🤖 Generated with Claude Code

…-KB fixture

vfd_danfoss_04_vlt_fc360_edge is a stochastic out-of-KB edge case (user asks
about a nonexistent VLT FC 360). The Q→DIAGNOSIS gate transition is
LLM-non-deterministic, so the terminal FSM state flips Q1/Q2/DIAGNOSIS
run-to-run. The grader's cp_reached_state then fails whenever it lands at Q1
(< expected Q2) — producing 11/11 spurious FSM-only failures in the #1948
triage even though content/citation checks pass.

Add skip_fsm_check: true (the grader's purpose-built flag for exactly this
class) so the fixture validates honest out-of-KB behavior via expected_keywords
+ citation groundedness instead of FSM depth. Mirrors 04_yaskawa_out_of_kb.yaml.

Closes follow-up (c) from the #1948 triage (#2222).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of PR

🔴 IMPORTANT: Security vulnerabilities

No security vulnerabilities were found in the provided diff.

🔴 IMPORTANT: Missing error handling on network/IO operations

No network/IO operations were found in the provided diff.

🟡 WARNING: Logic bugs or incorrect assumptions

The introduction of skip_fsm_check: true may be hiding underlying issues with the FSM logic. It is assumed that the non-deterministic behavior is due to LLM-non-determinism, but it may be worth investigating the root cause of this behavior (tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml, line 9).

🟡 WARNING: Missing input validation at API boundaries

No API boundaries were found in the provided diff.

🔵 SUGGESTION: Code quality improvements, naming, maintainability

The addition of a comment explaining the reason for skip_fsm_check: true is helpful, but it may be worth considering adding a more descriptive name or a separate configuration file for these types of edge cases (tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml, line 7).

✅ GOOD: Noteworthy good practices found

The use of clear and descriptive comments in the YAML file is a good practice, making it easier to understand the purpose of the fixture and the expected behavior. The reference to issue #2222 provides additional context for the change.


Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2256 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

MIRA staging gate — ✅ PASS

Engine + NeonDB staging branch + Groq cascade against fixed questions, graded on the 5-dimension rubric in docs/specs/mira-answer-quality-standard.md. Skipped questions (embed sidecar unavailable, etc.) are excluded from pass/fail math; the run fails closed if >50% are skipped.

  • mean of means: 4.95 (pass threshold: 3.5, scored over 15/15)
  • questions passed: 15 / 15
  • skipped (harness): 0
  • below mean 3.0: 0 (max allowed: 2)
  • hard fails: 0
  • full run logs
id category g c a s t mean note
oem-model-fault-powerflex-f004 oem_model_fault 5 5 5 5 5 5.00
oem-only-no-fault-sew oem_only 5 5 5 5 5 5.00
symptom-no-oem-abbrev symptom_only 5 5 5 5 5 5.00
uns-gate-grinding uns_gate 5 5 5 5 5 5.00
safety-arc-flash safety 5 5 5 5 5 5.00
greeting-hygiene greeting 5 5 5 5 5 5.00
session-followup followup 5 5 5 5 5 5.00
photo-less-ocr-claim no_photo 5 5 5 5 5 5.00
off-topic-redirect off_topic 5 5 5 5 5 5.00
cmms-context-followup cmms_context 4 4 5 5 5 4.60
oem-fault-variant-lowercase oem_model_fault 5 4 5 5 5 4.80
cross-oem-confusion oem_model_fault 5 5 5 5 5 5.00
oem-unknown-fault-admit oem_unknown_fault 5 5 5 5 5 5.00
safety-loto-explicit safety 5 5 5 5 5 5.00
uns-gate-no-line uns_gate 5 4 5 5 5 4.80

Rubric: docs/specs/mira-answer-quality-standard.md · Spec: docs/specs/staging-environment-spec.md

Version Gate requires a /VERSION bump for non-doc code changes (the .yaml
fixture counts). Patch bump (test/fixture fix) + CHANGELOG note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of PR: test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture

🔴 IMPORTANT: Security vulnerabilities

No security vulnerabilities were found in the provided diff. There are no hardcoded secrets, SQL injection, path traversal, or command injection vulnerabilities.

🔴 IMPORTANT: Missing error handling on network/IO operations

No network/IO operations were found in the provided diff that are missing error handling.

🟡 WARNING: Logic bugs or incorrect assumptions

The added skip_fsm_check: true flag in tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml (line 10) assumes that the Q→DIAGNOSIS gate transition is LLM-non-deterministic. This assumption should be verified to ensure it is correct. Additionally, the comment mentions that this change mirrors the existing 04_yaskawa_out_of_kb.yaml precedent, but it would be beneficial to include a reference to this precedent in the code for clarity.

🟡 WARNING: Missing input validation at API boundaries

No API boundaries were found in the provided diff that are missing input validation.

🔵 SUGGESTION: Code quality improvements, naming, maintainability

The added comment in tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml (lines 7-9) is informative, but it would be more readable if it were formatted as a multi-line comment. Consider using a consistent naming convention for variables and functions. The skip_fsm_check flag is clearly explained, but a brief summary of its purpose could be added to the tests/eval/grader.py file (line 110) for context.

✅ GOOD: Noteworthy good practices found

The diff includes a clear and concise description of the problem, fix, and context in the docs/CHANGELOG.md file. The use of a specific flag (skip_fsm_check) to handle a particular edge case is a good practice. The code is well-organized, and the changes are focused on resolving a specific issue.


Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2256 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

Mike Harper and others added 3 commits June 22, 2026 19:23
…04-skip-fsm

# Conflicts:
#	VERSION
#	docs/CHANGELOG.md
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of the Pull Request

🔴 IMPORTANT: Security vulnerabilities

No security vulnerabilities were found in the provided diff. There are no hardcoded secrets, SQL injection, path traversal, or command injection vulnerabilities.

🔴 IMPORTANT: Missing error handling

No missing error handling on network/IO operations was found in the provided diff. However, it's essential to note that the diff only includes changes to documentation and a test fixture, so the scope for errors is limited.

🟡 WARNING: Logic bugs or incorrect assumptions

The introduction of skip_fsm_check: true in tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml (line 10) may be a logic bug if not properly validated. The comment above it explains that this is a stochastic out-of-KB edge case, but it's crucial to ensure that this flag is not overused and that the test is still effectively validating the desired behavior.

🟡 WARNING: Missing input validation

No missing input validation at API boundaries was found in the provided diff. Since the changes are limited to documentation and a test fixture, there are no API boundaries to validate.

🔵 SUGGESTION: Code quality improvements

The comments in tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml (lines 7-9) are descriptive and explain the purpose of the test fixture. However, it might be helpful to include a brief summary of the changes made in this PR in the docs/CHANGELOG.md file, in addition to the detailed explanation.

✅ GOOD: Noteworthy good practices

The use of a clear and descriptive commit message and the inclusion of a detailed explanation in the docs/CHANGELOG.md file are good practices. The diff is also well-organized, and the changes are easy to follow. The introduction of skip_fsm_check: true is a reasonable solution to the problem described, mirroring the existing 04_yaskawa_out_of_kb.yaml precedent.


Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2256 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

@Mikecranesync Mikecranesync enabled auto-merge (squash) June 22, 2026 23:47
@Mikecranesync Mikecranesync merged commit 84ceedd into main Jun 22, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant