Skip to content

docs(eval): triage #1948 — "regression" is noise; real cause is non-deterministic FSM gate#2222

Open
Mikecranesync wants to merge 1 commit into
mainfrom
docs/eval-1948-triage
Open

docs(eval): triage #1948 — "regression" is noise; real cause is non-deterministic FSM gate#2222
Mikecranesync wants to merge 1 commit into
mainfrom
docs/eval-1948-triage

Conversation

@Mikecranesync

Copy link
Copy Markdown
Owner

Summary

Triage of #1948 ("eval: 78% pass rate regression −9pts from 87%, 3 failure clusters"), selected as the top actionable beta-readiness item after #2152 (prod outage) was found recovered and #2112 (security leak) already fixed by #2127.

The "−9pt regression" is a false alarm. Current main re-runs at 49/57 (85%) — healthy, and above the "regressed" 45/57. The offline suite swings 38%–87% run-to-run; #1948 caught a low-water-mark sample.

What's actually going on

Using the 11 committed historical scorecards + a fresh current-main run, the real persistent failures narrow to ~4 cases (not 12):

Case Now Real issue
vfd_mitsu_03_a700 ✓ PASS already fixed since baseline
pf525_f004_02 ✗ FSM only ends Q2 not DIAGNOSIS (content otherwise good)
asset_change_mid_session_08 ✗ FSM only ends Q1
vfd_danfoss_04 ✗ FSM only fixture/grader mismatch (expected_final_state: Q2)
gs3_ground_fault_14 ✗ KeyKW only citation/keyword miss (Nemotron-404 aggravated)

Root cause: the engine's Q→DIAGNOSIS gate transition is LLM-non-deterministic — identical scripted pf525_f004_02 turns reach DIAGNOSIS in one run and stall at Q2 in another. (An earlier draft wrongly blamed the synthetic-user driver; that path is off by default — corrected in the doc.) The eval has a record/replay determinism seam (llm_replay.py) but runs live because the replay store is .gitignored/absent.

Separate infra finding: the Nemotron reranker is 404-down on every retrieval (integrate.api.nvidia.com/v1/ranking), silently degrading grounding — deserves its own issue.

Recommendation

Reframe #1948 from "engine regression" → three scoped follow-ups:

  • (a) eval determinism via record/replay (highest leverage — makes the suite able to detect a real regression)
  • (b) Nemotron-404 reranker outage (new infra issue)
  • (c) vfd_danfoss_04 fixture fix

Full detail: docs/tech-debt/2026-06-22-eval-1948-flakiness-triage.md.

Docs-only; no code change.

…eterministic FSM gate

Investigation of #1948 ("eval 78% regression −9pts, 3 clusters"). Findings:

- The −9pt "regression" is within the eval's noise band: committed real scorecards
  swing 38%–87% run-to-run. Current main re-run = 49/57 (85%), healthy and ABOVE
  the "regressed" 45/57. #1948 caught a low-water-mark run.
- Real signal (from 11 historical runs + a current-main re-run): ~4 persistent
  cases, not 12. One (vfd_mitsu_03) already fixed since baseline. Three fail only
  the FSM checkpoint; one (gs3_ground_fault_14) fails only KeyKW.
- Root cause = LLM-non-deterministic Q→DIAGNOSIS gate transition: identical
  SCRIPTED pf525_f004_02 turns reach DIAGNOSIS in one run, stall at Q2 in another.
  (Corrected an earlier wrong draft that blamed the synthetic-user driver — that
  path is off by default; the default suite uses scripted turns.)
- The eval has a record/replay determinism seam (llm_replay.py) but runs live
  because the replay store is .gitignored/absent → highest-leverage fix.
- Separate infra finding: the Nemotron reranker is 404-down on every retrieval,
  degrading citation/keyword grounding (deserves its own issue).

Recommends reframing #1948 from "engine regression" to: (a) eval determinism via
record/replay, (b) Nemotron-404 reranker outage, (c) vfd_danfoss_04 fixture fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Mikecranesync

Copy link
Copy Markdown
Owner Author

Triage acted on — #1948 closed as not planned (false regression; current main healthy at 85%). The three genuine follow-ups this doc recommends are now filed:

This PR remains the canonical reference doc for the investigation.

Mikecranesync added a commit that referenced this pull request Jun 22, 2026
…-KB fixture (#2256)

* test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture

vfd_danfoss_04_vlt_fc360_edge is a stochastic out-of-KB edge case (user asks
about a nonexistent VLT FC 360). The Q→DIAGNOSIS gate transition is
LLM-non-deterministic, so the terminal FSM state flips Q1/Q2/DIAGNOSIS
run-to-run. The grader's cp_reached_state then fails whenever it lands at Q1
(< expected Q2) — producing 11/11 spurious FSM-only failures in the #1948
triage even though content/citation checks pass.

Add skip_fsm_check: true (the grader's purpose-built flag for exactly this
class) so the fixture validates honest out-of-KB behavior via expected_keywords
+ citation groundedness instead of FSM depth. Mirrors 04_yaskawa_out_of_kb.yaml.

Closes follow-up (c) from the #1948 triage (#2222).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(version): bump to 3.39.14 for vfd_danfoss_04 fixture fix

Version Gate requires a /VERSION bump for non-doc code changes (the .yaml
fixture counts). Patch bump (test/fixture fix) + CHANGELOG note.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(changelog): drop leftover merge-conflict marker

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Mike Harper <bravonode@FactoryLM-Bravo.local>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant