ci: regression suite — difficulty contract + perf budgets by antonstefer · Pull Request #27 · antonstefer/logic-grid

antonstefer · 2026-04-29T09:01:48Z

Summary

Adds a regression suite that enforces the difficulty contract (typesUpToTier / typesAtTier from src/difficulty.ts), checks constraint-type diversity, and pins a generous perf budget per grid size. Lives in its own workflow file so docs-only PRs don't pay the bench cost or risk perf-budget flakes.

Three layers of checks (in `bench/regression.test.ts`)

1. Difficulty contract (50 puzzles per difficulty)

easy puzzles: every type is allowed at the easy tier; deduction doesn't quietly require contradiction
medium puzzles: every type is allowed at medium with at least one beyond easy; deduction stays within medium
hard puzzles: at least one type beyond medium; deduction stays within hard
expert puzzles: deduction either has a contradiction step or complete === false

Every assertion includes the failing seed in its message so a 1-of-50 failure is reproducible without binary search.

2. Constraint diversity

Across 50 hard puzzles, each hard-tier type must appear at least once. Test name explicit about this being a "seeds 0..49 at 4×4" claim, not a reachability proof — distribution shifts produce identical failures.
No single constraint type > 80% of clues at medium/hard/expert. Easy excluded (2-type tier means dominance is structural).
4×4 medium puzzles have 3-15 clues (sanity bound).

3. Perf budgets (median of N runs per size)

Size	Budget	Runs
3×3	30ms	10
4×4	50ms	10
5×5	100ms	10
6×6	250ms	5
7×7	600ms	5
8×8	1200ms	5

Real measured values are 1-18ms; budgets are ~50-100× to absorb shared-runner variance while still catching an order-of-magnitude regression. JIT warmed locally via beforeAll so reordering describe blocks can't flake the 3×3 budget.

Wiring

New regression.yml workflow on ubuntu-latest with Node 24 (single shape on purpose). Image pinning was considered and rejected: the 50-100× headroom in the perf budgets absorbs runner drift, so the pin would trade a real maintenance cost (deprecated images, missed updates) for a hypothetical risk.
paths: filter so it only runs when packages/logic-grid/** or its workflow file change. Two consequences documented inline in the file:
1. Don't make Regression / regression a required status check while the paths: filter is in place — GitHub skips → "missing" → blocks merge.
2. The split removed the old needs: check short-circuit; regression now runs concurrently with the matrix check. Conscious tradeoff at this PR volume.
Difficulty-tier source-of-truth refactor: replaced the asymmetric EASY_TYPES / MEDIUM_TYPES / HARD_ONLY_TYPES exports with typesAtTier(tier) and typesUpToTier(tier) helpers derived from a single TYPE_TIER Record. Adding a new ConstraintType is a TypeScript error in the Record until its tier is decided. Both helpers return memoized ReadonlySet<ConstraintType> (compile-time immutability + zero allocation per call). 100% coverage maintained.

Adds packages/logic-grid/bench/regression.test.ts with strict assertions, plus a separate \`regression\` job in ci.yml that runs after \`check\` so flaky perf doesn't block normal PR signals. Three layers of checks: 1. Difficulty contract — directly tied to EASY_TYPES / MEDIUM_TYPES from src/difficulty.ts (+ deduce-based expert promotion). Catches the generator silently leaking a higher-tier constraint into a lower tier (or vice versa). 50 puzzles per difficulty. 2. Constraint diversity — across 50 hard puzzles, each of the four hard-only types (between, not_between, not_next_to, exact_distance) must appear at least once; catches silent type-dropouts. Plus a no-single-type > 80% sanity check at medium/hard/expert (easy is excluded — its 2-type set means dominance is structural). 3. Perf budgets — median of N runs per grid size, calibrated 50-100× above current real values to absorb GitHub-runner variance while still flagging an order-of-magnitude regression. The CI job pins to Node 24 only (single perf shape) and \`needs: check\` so we don't burn bench time when basic tests are red.

cloudflare-workers-and-pages · 2026-04-29T09:01:52Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	logic-grid	`be755ec`	Commit Preview URL Branch Preview URL	Apr 29 2026, 02:07 PM

Six review fixes: - Doc-comment said "Two layers" but enumerated three. Now says "Three". - HARD_ONLY_TYPES no longer hardcoded in the test. Lifted to difficulty.ts alongside ALL_CONSTRAINT_TYPES, derived from it. ALL_CONSTRAINT_TYPES uses a Record<ConstraintType, true> so a new ConstraintType variant is a TypeScript error rather than silent stale config. - Surface seed in assertion messages so a 1-of-50 contract failure tells you which puzzle to reproduce. - Drop the redundant explicit warm-up generate() call. Earlier describe blocks already do hundreds of generations before perf budgets run; added one line of comment in case the order changes later. - Comment the upper-middle-vs-strict-median quirk so it doesn't read as a bug. Note that runs is the first lever to pull if it gets noisy. - Document on ci.yml why \`needs: check\` waits for the FULL Node matrix (intentional — all supported Nodes green before bench), so an apparent regression-job hang is debuggable.

ALL_CONSTRAINT_TYPES was an exported intermediate that nothing outside this file used (the Record was the actual exhaustiveness check; the array form was just scaffolding). Inline the Record into the HARD_ONLY_TYPES expression with `satisfies` to keep the compile-time check without leaking a half-public symbol.

Five fixes: 1. Single source of truth for difficulty tiers. difficulty.ts now uses a TYPE_TIER Record<ConstraintType, "easy" | "medium" | "hard"> as the only place where the tier of each constraint type is decided. EASY_TYPES, MEDIUM_TYPES, and HARD_ONLY_TYPES are derived from it. Adding a new ConstraintType is a TS error in the Record until its tier is decided (no more "did I remember to update MEDIUM_TYPES too?"). 2. Tighten the diversity test name. "every documented hard-only constraint type is reachable from hard generation" → "every hard-only type appears across 50 seeds at 4×4 hard", with a comment pointing the next debugger at the two real failure modes (dropped type vs distribution shift). 3. Local JIT warm-up. perf-budgets describe now has its own beforeAll warm-up call instead of relying on earlier describes. Re-ordering or splitting the file no longer risks a cold-start 3×3 flake. 4. Easy/medium/hard tests now also assert NOT expert. A regression that silently promotes everything to expert (e.g. a deduce change) is now caught. Extracted isExpertSolution() helper to keep the assertions tight. 5. Split regression into its own workflow with paths gating. ci.yml now contains only check + build (runs on every PR). regression.yml fires only when packages/logic-grid/** or its workflow file change. Docs-only PRs no longer pay the bench cost or risk perf-budget flakes.

…ession.yml Two intentional consequences of splitting regression into its own workflow: 1. paths: filter + required-check is a footgun. GitHub skips the workflow on filtered PRs, and a skipped required check counts as missing, not passed — blocks merge. Don't make this required without adding a signal-success-on-skip companion job. 2. Lost the old needs: check gate. Regression now runs concurrently with the matrix check; basic-test failures don't short-circuit bench cost. Trade-off: parallelism + independent perf signal vs occasional wasted runs. Worth it at this PR volume. No behavior change — comments only.

… helpers EASY_TYPES (tier-only) and MEDIUM_TYPES (cumulative) had asymmetric semantics that the names hid; HARD_ONLY_TYPES added a third shape. Replace all three with two self-documenting helpers derived from a single TYPE_TIER record: typesAtTier("medium") → just medium-tier types (no easy) typesUpToTier("medium") → easy + medium tiers (cumulative) Both shapes are load-bearing — typesUpToTier in classifyByTypes and filterByDifficulty (asking "is this type allowed at difficulty X"), typesAtTier in the regression bench (asking "did each tier-X type appear"). Now the call site picks the shape it actually wants. Also factored a TIER_RANK array so classifyByTypes is a single max-rank loop instead of two flag booleans, and exported a ConstraintTier type to surface the easy/medium/hard string union for the helpers. Added unit tests covering both helpers — coverage stays at 100%. Internal refactor: not in index.ts surface, no breaking change for external consumers (the constants weren't re-exported there either).

generate() calls filterByDifficulty(), which calls typesUpToTier() — so the previous implementation allocated a fresh Set on every generation. Hot path. Move both lookups behind module-level Records so each call is a single property access. Function shape unchanged for callers.

Five fixes — ubuntu-latest stays (50-100× headroom absorbs runner drift; pinning trades a real maintenance burden for hypothetical risk). - typesAtTier / typesUpToTier now return ReadonlySet — TS prevents callers from mutating the shared module-state instance. - Both helpers normalized to ReadonlySet (was: array vs Set asymmetry). Test asserts use [...set] so the shape change is invisible to readers. - Bench-test locals renamed (allowedAtEasy / allowedAtMedium / hardOnly) so a future grep for legacy EASY_TYPES / MEDIUM_TYPES / HARD_ONLY_TYPES doesn't land on what looks like a live reference. - Hot-path comment in difficulty.ts reworded — generate() calls filterByDifficulty once per puzzle, not once per constraint. - PR body updated separately via gh pr edit to match the actual workflow split (regression.yml is its own file, no needs: check anymore).

Two clear fixes plus a pushback on the third. 1. filterByDifficulty's null-branch comment now mentions expert. The branch fires for both \"hard\" and \"expert\" (anything not easy/medium); the old comment only said \"hard allows all types\". 2. Reword the expert contract test name. \"fail to fully deduce\" reads ambiguously — could be \"test failed\" rather than \"puzzle requires backtracking\". New name is unambiguous about the subject (the puzzle). 3. Skip pinning the perf-budget test to a specific difficulty. The existing call \`generate({ size, categories, seed })\` measures real generate() perf with default options — which is what users hit. Pinning to \"hard\" trades that signal for catching only hard-specific regressions AND invalidates the existing calibration. Added one line of comment explaining the deliberate no-difficulty choice instead.

antonstefer added 8 commits April 29, 2026 11:09

antonstefer merged commit 529c7f5 into main Apr 29, 2026
5 checks passed

antonstefer deleted the ci/regression-suite branch April 29, 2026 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: regression suite — difficulty contract + perf budgets#27

ci: regression suite — difficulty contract + perf budgets#27
antonstefer merged 9 commits into
mainfrom
ci/regression-suite

antonstefer commented Apr 29, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

antonstefer commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Three layers of checks (in bench/regression.test.ts)

1. Difficulty contract (50 puzzles per difficulty)

2. Constraint diversity

3. Perf budgets (median of N runs per size)

Wiring

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

antonstefer commented Apr 29, 2026 •

edited

Loading

Three layers of checks (in `bench/regression.test.ts`)

cloudflare-workers-and-pages Bot commented Apr 29, 2026 •

edited

Loading