ci: regression suite — difficulty contract + perf budgets#27
Merged
Conversation
Adds packages/logic-grid/bench/regression.test.ts with strict assertions, plus a separate \`regression\` job in ci.yml that runs after \`check\` so flaky perf doesn't block normal PR signals. Three layers of checks: 1. Difficulty contract — directly tied to EASY_TYPES / MEDIUM_TYPES from src/difficulty.ts (+ deduce-based expert promotion). Catches the generator silently leaking a higher-tier constraint into a lower tier (or vice versa). 50 puzzles per difficulty. 2. Constraint diversity — across 50 hard puzzles, each of the four hard-only types (between, not_between, not_next_to, exact_distance) must appear at least once; catches silent type-dropouts. Plus a no-single-type > 80% sanity check at medium/hard/expert (easy is excluded — its 2-type set means dominance is structural). 3. Perf budgets — median of N runs per grid size, calibrated 50-100× above current real values to absorb GitHub-runner variance while still flagging an order-of-magnitude regression. The CI job pins to Node 24 only (single perf shape) and \`needs: check\` so we don't burn bench time when basic tests are red.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
logic-grid | be755ec | Commit Preview URL Branch Preview URL |
Apr 29 2026, 02:07 PM |
Six review fixes: - Doc-comment said "Two layers" but enumerated three. Now says "Three". - HARD_ONLY_TYPES no longer hardcoded in the test. Lifted to difficulty.ts alongside ALL_CONSTRAINT_TYPES, derived from it. ALL_CONSTRAINT_TYPES uses a Record<ConstraintType, true> so a new ConstraintType variant is a TypeScript error rather than silent stale config. - Surface seed in assertion messages so a 1-of-50 contract failure tells you which puzzle to reproduce. - Drop the redundant explicit warm-up generate() call. Earlier describe blocks already do hundreds of generations before perf budgets run; added one line of comment in case the order changes later. - Comment the upper-middle-vs-strict-median quirk so it doesn't read as a bug. Note that runs is the first lever to pull if it gets noisy. - Document on ci.yml why \`needs: check\` waits for the FULL Node matrix (intentional — all supported Nodes green before bench), so an apparent regression-job hang is debuggable.
ALL_CONSTRAINT_TYPES was an exported intermediate that nothing outside this file used (the Record was the actual exhaustiveness check; the array form was just scaffolding). Inline the Record into the HARD_ONLY_TYPES expression with `satisfies` to keep the compile-time check without leaking a half-public symbol.
Five fixes: 1. Single source of truth for difficulty tiers. difficulty.ts now uses a TYPE_TIER Record<ConstraintType, "easy" | "medium" | "hard"> as the only place where the tier of each constraint type is decided. EASY_TYPES, MEDIUM_TYPES, and HARD_ONLY_TYPES are derived from it. Adding a new ConstraintType is a TS error in the Record until its tier is decided (no more "did I remember to update MEDIUM_TYPES too?"). 2. Tighten the diversity test name. "every documented hard-only constraint type is reachable from hard generation" → "every hard-only type appears across 50 seeds at 4×4 hard", with a comment pointing the next debugger at the two real failure modes (dropped type vs distribution shift). 3. Local JIT warm-up. perf-budgets describe now has its own beforeAll warm-up call instead of relying on earlier describes. Re-ordering or splitting the file no longer risks a cold-start 3×3 flake. 4. Easy/medium/hard tests now also assert NOT expert. A regression that silently promotes everything to expert (e.g. a deduce change) is now caught. Extracted isExpertSolution() helper to keep the assertions tight. 5. Split regression into its own workflow with paths gating. ci.yml now contains only check + build (runs on every PR). regression.yml fires only when packages/logic-grid/** or its workflow file change. Docs-only PRs no longer pay the bench cost or risk perf-budget flakes.
…ession.yml Two intentional consequences of splitting regression into its own workflow: 1. paths: filter + required-check is a footgun. GitHub skips the workflow on filtered PRs, and a skipped required check counts as missing, not passed — blocks merge. Don't make this required without adding a signal-success-on-skip companion job. 2. Lost the old needs: check gate. Regression now runs concurrently with the matrix check; basic-test failures don't short-circuit bench cost. Trade-off: parallelism + independent perf signal vs occasional wasted runs. Worth it at this PR volume. No behavior change — comments only.
… helpers
EASY_TYPES (tier-only) and MEDIUM_TYPES (cumulative) had asymmetric
semantics that the names hid; HARD_ONLY_TYPES added a third shape.
Replace all three with two self-documenting helpers derived from a
single TYPE_TIER record:
typesAtTier("medium") → just medium-tier types (no easy)
typesUpToTier("medium") → easy + medium tiers (cumulative)
Both shapes are load-bearing — typesUpToTier in classifyByTypes and
filterByDifficulty (asking "is this type allowed at difficulty X"),
typesAtTier in the regression bench (asking "did each tier-X type
appear"). Now the call site picks the shape it actually wants.
Also factored a TIER_RANK array so classifyByTypes is a single max-rank
loop instead of two flag booleans, and exported a ConstraintTier type
to surface the easy/medium/hard string union for the helpers.
Added unit tests covering both helpers — coverage stays at 100%.
Internal refactor: not in index.ts surface, no breaking change for
external consumers (the constants weren't re-exported there either).
generate() calls filterByDifficulty(), which calls typesUpToTier() — so the previous implementation allocated a fresh Set on every generation. Hot path. Move both lookups behind module-level Records so each call is a single property access. Function shape unchanged for callers.
Five fixes — ubuntu-latest stays (50-100× headroom absorbs runner drift; pinning trades a real maintenance burden for hypothetical risk). - typesAtTier / typesUpToTier now return ReadonlySet — TS prevents callers from mutating the shared module-state instance. - Both helpers normalized to ReadonlySet (was: array vs Set asymmetry). Test asserts use [...set] so the shape change is invisible to readers. - Bench-test locals renamed (allowedAtEasy / allowedAtMedium / hardOnly) so a future grep for legacy EASY_TYPES / MEDIUM_TYPES / HARD_ONLY_TYPES doesn't land on what looks like a live reference. - Hot-path comment in difficulty.ts reworded — generate() calls filterByDifficulty once per puzzle, not once per constraint. - PR body updated separately via gh pr edit to match the actual workflow split (regression.yml is its own file, no needs: check anymore).
Two clear fixes plus a pushback on the third.
1. filterByDifficulty's null-branch comment now mentions expert. The
branch fires for both \"hard\" and \"expert\" (anything not easy/medium);
the old comment only said \"hard allows all types\".
2. Reword the expert contract test name. \"fail to fully deduce\" reads
ambiguously — could be \"test failed\" rather than \"puzzle requires
backtracking\". New name is unambiguous about the subject (the puzzle).
3. Skip pinning the perf-budget test to a specific difficulty. The
existing call \`generate({ size, categories, seed })\` measures real
generate() perf with default options — which is what users hit. Pinning
to \"hard\" trades that signal for catching only hard-specific regressions
AND invalidates the existing calibration. Added one line of comment
explaining the deliberate no-difficulty choice instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a regression suite that enforces the difficulty contract (
typesUpToTier/typesAtTierfromsrc/difficulty.ts), checks constraint-type diversity, and pins a generous perf budget per grid size. Lives in its own workflow file so docs-only PRs don't pay the bench cost or risk perf-budget flakes.Three layers of checks (in
bench/regression.test.ts)1. Difficulty contract (50 puzzles per difficulty)
easypuzzles: every type is allowed at the easy tier; deduction doesn't quietly require contradictionmediumpuzzles: every type is allowed at medium with at least one beyond easy; deduction stays within mediumhardpuzzles: at least one type beyond medium; deduction stays within hardexpertpuzzles: deduction either has acontradictionstep orcomplete === falseEvery assertion includes the failing seed in its message so a 1-of-50 failure is reproducible without binary search.
2. Constraint diversity
3. Perf budgets (median of N runs per size)
Real measured values are 1-18ms; budgets are ~50-100× to absorb shared-runner variance while still catching an order-of-magnitude regression. JIT warmed locally via
beforeAllso reordering describe blocks can't flake the 3×3 budget.Wiring
regression.ymlworkflow onubuntu-latestwith Node 24 (single shape on purpose). Image pinning was considered and rejected: the 50-100× headroom in the perf budgets absorbs runner drift, so the pin would trade a real maintenance cost (deprecated images, missed updates) for a hypothetical risk.paths:filter so it only runs whenpackages/logic-grid/**or its workflow file change. Two consequences documented inline in the file:Regression / regressiona required status check while thepaths:filter is in place — GitHub skips → "missing" → blocks merge.needs: checkshort-circuit; regression now runs concurrently with the matrix check. Conscious tradeoff at this PR volume.EASY_TYPES/MEDIUM_TYPES/HARD_ONLY_TYPESexports withtypesAtTier(tier)andtypesUpToTier(tier)helpers derived from a singleTYPE_TIERRecord. Adding a newConstraintTypeis a TypeScript error in the Record until its tier is decided. Both helpers return memoizedReadonlySet<ConstraintType>(compile-time immutability + zero allocation per call). 100% coverage maintained.