test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture by Mikecranesync · Pull Request #2256 · Mikecranesync/MIRA

Mikecranesync · 2026-06-22T20:39:25Z

What

Adds skip_fsm_check: true to tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml.

Why

This is follow-up (c) from the #1948 eval-flakiness triage (detail: PR #2222 / docs/tech-debt/2026-06-22-eval-1948-flakiness-triage.md).

vfd_danfoss_04 is a stochastic out-of-KB edge case — the user asks about a nonexistent VLT FC 360 (real model is FC 350). The engine's Q→DIAGNOSIS gate transition is LLM-non-deterministic, so the terminal FSM state flips Q1 / Q2 / DIAGNOSIS run-to-run. The grader's cp_reached_state passes at Q2 or beyond but fails at Q1, so the fixture logged 11/11 spurious FSM-only failures in the triage even though keyword + citation checks pass.

The grader already has a purpose-built flag for exactly this class:

skip_fsm_check: for stochastic out-of-KB scenarios where FSM state is non-deterministic but content correctness (honesty) is validated by cp_keyword_match and cp_citation_groundedness. — tests/eval/grader.py:110

This fixture's honest-out-of-KB behavior is validated by expected_keywords (manual, searching, documentation, …) + citation groundedness — not FSM depth. The fix mirrors the existing precedent 04_yaskawa_out_of_kb.yaml (Q1 + skip_fsm_check: true + max_turns: 3).

Scope

One-line fixture change (+ comment). No engine/grader code change. The deeper non-determinism is tracked separately as follow-up (a) — eval-suite determinism via record/replay.

Verification

yaml.safe_load parses; skip_fsm_check=True, expected_final_state=Q2, max_turns=3.
Matches the established skip_fsm_check convention (04_yaskawa_out_of_kb.yaml).

🤖 Generated with Claude Code

…-KB fixture vfd_danfoss_04_vlt_fc360_edge is a stochastic out-of-KB edge case (user asks about a nonexistent VLT FC 360). The Q→DIAGNOSIS gate transition is LLM-non-deterministic, so the terminal FSM state flips Q1/Q2/DIAGNOSIS run-to-run. The grader's cp_reached_state then fails whenever it lands at Q1 (< expected Q2) — producing 11/11 spurious FSM-only failures in the #1948 triage even though content/citation checks pass. Add skip_fsm_check: true (the grader's purpose-built flag for exactly this class) so the fixture validates honest out-of-KB behavior via expected_keywords + citation groundedness instead of FSM depth. Mirrors 04_yaskawa_out_of_kb.yaml. Closes follow-up (c) from the #1948 triage (#2222). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-22T20:40:24Z

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of PR

🔴 IMPORTANT: Security vulnerabilities

No security vulnerabilities were found in the provided diff.

🔴 IMPORTANT: Missing error handling on network/IO operations

No network/IO operations were found in the provided diff.

🟡 WARNING: Logic bugs or incorrect assumptions

The introduction of skip_fsm_check: true may be hiding underlying issues with the FSM logic. It is assumed that the non-deterministic behavior is due to LLM-non-determinism, but it may be worth investigating the root cause of this behavior (tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml, line 9).

🟡 WARNING: Missing input validation at API boundaries

No API boundaries were found in the provided diff.

🔵 SUGGESTION: Code quality improvements, naming, maintainability

The addition of a comment explaining the reason for skip_fsm_check: true is helpful, but it may be worth considering adding a more descriptive name or a separate configuration file for these types of edge cases (tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml, line 7).

✅ GOOD: Noteworthy good practices found

The use of clear and descriptive comments in the YAML file is a good practice, making it easier to understand the purpose of the fixture and the expected behavior. The reference to issue #2222 provides additional context for the change.

Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2256 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

github-actions · 2026-06-22T20:42:00Z

MIRA staging gate — ✅ PASS

Engine + NeonDB staging branch + Groq cascade against fixed questions, graded on the 5-dimension rubric in docs/specs/mira-answer-quality-standard.md. Skipped questions (embed sidecar unavailable, etc.) are excluded from pass/fail math; the run fails closed if >50% are skipped.

mean of means: 4.95 (pass threshold: 3.5, scored over 15/15)
questions passed: 15 / 15
skipped (harness): 0
below mean 3.0: 0 (max allowed: 2)
hard fails: 0
full run logs

id	category	g	c	a	s	t	mean
✅ `oem-model-fault-powerflex-f004`	oem_model_fault	5	5	5	5	5	5.00
✅ `oem-only-no-fault-sew`	oem_only	5	5	5	5	5	5.00
✅ `symptom-no-oem-abbrev`	symptom_only	5	5	5	5	5	5.00
✅ `uns-gate-grinding`	uns_gate	5	5	5	5	5	5.00
✅ `safety-arc-flash`	safety	5	5	5	5	5	5.00
✅ `greeting-hygiene`	greeting	5	5	5	5	5	5.00
✅ `session-followup`	followup	5	5	5	5	5	5.00
✅ `photo-less-ocr-claim`	no_photo	5	5	5	5	5	5.00
✅ `off-topic-redirect`	off_topic	5	5	5	5	5	5.00
✅ `cmms-context-followup`	cmms_context	4	4	5	5	5	4.60
✅ `oem-fault-variant-lowercase`	oem_model_fault	5	4	5	5	5	4.80
✅ `cross-oem-confusion`	oem_model_fault	5	5	5	5	5	5.00
✅ `oem-unknown-fault-admit`	oem_unknown_fault	5	5	5	5	5	5.00
✅ `safety-loto-explicit`	safety	5	5	5	5	5	5.00
✅ `uns-gate-no-line`	uns_gate	5	4	5	5	5	4.80

Rubric: docs/specs/mira-answer-quality-standard.md · Spec: docs/specs/staging-environment-spec.md

Version Gate requires a /VERSION bump for non-doc code changes (the .yaml fixture counts). Patch bump (test/fixture fix) + CHANGELOG note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-22T20:57:51Z

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of PR: test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture

🔴 IMPORTANT: Security vulnerabilities

No security vulnerabilities were found in the provided diff. There are no hardcoded secrets, SQL injection, path traversal, or command injection vulnerabilities.

🔴 IMPORTANT: Missing error handling on network/IO operations

No network/IO operations were found in the provided diff that are missing error handling.

🟡 WARNING: Logic bugs or incorrect assumptions

The added skip_fsm_check: true flag in tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml (line 10) assumes that the Q→DIAGNOSIS gate transition is LLM-non-deterministic. This assumption should be verified to ensure it is correct. Additionally, the comment mentions that this change mirrors the existing 04_yaskawa_out_of_kb.yaml precedent, but it would be beneficial to include a reference to this precedent in the code for clarity.

🟡 WARNING: Missing input validation at API boundaries

No API boundaries were found in the provided diff that are missing input validation.

🔵 SUGGESTION: Code quality improvements, naming, maintainability

The added comment in tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml (lines 7-9) is informative, but it would be more readable if it were formatted as a multi-line comment. Consider using a consistent naming convention for variables and functions. The skip_fsm_check flag is clearly explained, but a brief summary of its purpose could be added to the tests/eval/grader.py file (line 110) for context.

✅ GOOD: Noteworthy good practices found

The diff includes a clear and concise description of the problem, fix, and context in the docs/CHANGELOG.md file. The use of a specific flag (skip_fsm_check) to handle a particular edge case is a good practice. The code is well-organized, and the changes are focused on resolving a specific issue.

Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2256 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

…04-skip-fsm # Conflicts: # VERSION # docs/CHANGELOG.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…04-skip-fsm # Conflicts: # VERSION

github-actions · 2026-06-22T23:47:32Z

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of the Pull Request

🔴 IMPORTANT: Security vulnerabilities

No security vulnerabilities were found in the provided diff. There are no hardcoded secrets, SQL injection, path traversal, or command injection vulnerabilities.

🔴 IMPORTANT: Missing error handling

No missing error handling on network/IO operations was found in the provided diff. However, it's essential to note that the diff only includes changes to documentation and a test fixture, so the scope for errors is limited.

🟡 WARNING: Logic bugs or incorrect assumptions

The introduction of skip_fsm_check: true in tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml (line 10) may be a logic bug if not properly validated. The comment above it explains that this is a stochastic out-of-KB edge case, but it's crucial to ensure that this flag is not overused and that the test is still effectively validating the desired behavior.

🟡 WARNING: Missing input validation

No missing input validation at API boundaries was found in the provided diff. Since the changes are limited to documentation and a test fixture, there are no API boundaries to validate.

🔵 SUGGESTION: Code quality improvements

The comments in tests/eval/fixtures/vfd_danfoss_04_vlt_fc360_edge.yaml (lines 7-9) are descriptive and explain the purpose of the test fixture. However, it might be helpful to include a brief summary of the changes made in this PR in the docs/CHANGELOG.md file, in addition to the detailed explanation.

✅ GOOD: Noteworthy good practices

The use of a clear and descriptive commit message and the inclusion of a detailed explanation in the docs/CHANGELOG.md file are good practices. The diff is also well-organized, and the changes are easy to follow. The introduction of skip_fsm_check: true is a reasonable solution to the problem described, mirroring the existing 04_yaskawa_out_of_kb.yaml precedent.

Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2256 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

Mikecranesync temporarily deployed to staging June 22, 2026 20:39 — with GitHub Actions Inactive

Mikecranesync mentioned this pull request Jun 22, 2026

test(eval): make offline-text suite deterministic via cascade record/replay (FSM-gate flakiness) #2258

Open

7 tasks

This was referenced Jun 22, 2026

eval: 78% pass rate regression (−9pts from 87%) — 3 failure clusters #1948

Closed

docs(eval): triage #1948 — "regression" is noise; real cause is non-deterministic FSM gate #2222

Open

chore(version): bump to 3.39.14 for vfd_danfoss_04 fixture fix

b32041d

Version Gate requires a /VERSION bump for non-doc code changes (the .yaml fixture counts). Patch bump (test/fixture fix) + CHANGELOG note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mikecranesync temporarily deployed to staging June 22, 2026 20:56 — with GitHub Actions Inactive

Mike Harper and others added 3 commits June 22, 2026 19:23

Merge remote-tracking branch 'origin/main' into fix/eval-vfd-danfoss-…

db90b3d

…04-skip-fsm # Conflicts: # VERSION # docs/CHANGELOG.md

fix(changelog): drop leftover merge-conflict marker

ca4cc9f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into fix/eval-vfd-danfoss-…

742fe70

…04-skip-fsm # Conflicts: # VERSION

Mikecranesync temporarily deployed to staging June 22, 2026 23:46 — with GitHub Actions Inactive

Mikecranesync enabled auto-merge (squash) June 22, 2026 23:47

Mikecranesync merged commit 84ceedd into main Jun 22, 2026
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture#2256

test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture#2256
Mikecranesync merged 5 commits into
mainfrom
fix/eval-vfd-danfoss-04-skip-fsm

Mikecranesync commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mikecranesync commented Jun 22, 2026

What

Why

Scope

Verification

Uh oh!

github-actions Bot commented Jun 22, 2026

🤖 AI Code Review

Review of PR

🔴 IMPORTANT: Security vulnerabilities

🔴 IMPORTANT: Missing error handling on network/IO operations

🟡 WARNING: Logic bugs or incorrect assumptions

🟡 WARNING: Missing input validation at API boundaries

🔵 SUGGESTION: Code quality improvements, naming, maintainability

✅ GOOD: Noteworthy good practices found

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MIRA staging gate — ✅ PASS

Uh oh!

github-actions Bot commented Jun 22, 2026

🤖 AI Code Review

Review of PR: test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture

🔴 IMPORTANT: Security vulnerabilities

🔴 IMPORTANT: Missing error handling on network/IO operations

🟡 WARNING: Logic bugs or incorrect assumptions

🟡 WARNING: Missing input validation at API boundaries

🔵 SUGGESTION: Code quality improvements, naming, maintainability

✅ GOOD: Noteworthy good practices found

Uh oh!

github-actions Bot commented Jun 22, 2026

🤖 AI Code Review

Review of the Pull Request

🔴 IMPORTANT: Security vulnerabilities

🔴 IMPORTANT: Missing error handling

🟡 WARNING: Logic bugs or incorrect assumptions

🟡 WARNING: Missing input validation

🔵 SUGGESTION: Code quality improvements

✅ GOOD: Noteworthy good practices

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 22, 2026 •

edited

Loading