Skip to content

ci: regression suite — difficulty contract + perf budgets#27

Merged
antonstefer merged 9 commits into
mainfrom
ci/regression-suite
Apr 29, 2026
Merged

ci: regression suite — difficulty contract + perf budgets#27
antonstefer merged 9 commits into
mainfrom
ci/regression-suite

Conversation

@antonstefer
Copy link
Copy Markdown
Owner

@antonstefer antonstefer commented Apr 29, 2026

Summary

Adds a regression suite that enforces the difficulty contract (typesUpToTier / typesAtTier from src/difficulty.ts), checks constraint-type diversity, and pins a generous perf budget per grid size. Lives in its own workflow file so docs-only PRs don't pay the bench cost or risk perf-budget flakes.

Three layers of checks (in bench/regression.test.ts)

1. Difficulty contract (50 puzzles per difficulty)

  • easy puzzles: every type is allowed at the easy tier; deduction doesn't quietly require contradiction
  • medium puzzles: every type is allowed at medium with at least one beyond easy; deduction stays within medium
  • hard puzzles: at least one type beyond medium; deduction stays within hard
  • expert puzzles: deduction either has a contradiction step or complete === false

Every assertion includes the failing seed in its message so a 1-of-50 failure is reproducible without binary search.

2. Constraint diversity

  • Across 50 hard puzzles, each hard-tier type must appear at least once. Test name explicit about this being a "seeds 0..49 at 4×4" claim, not a reachability proof — distribution shifts produce identical failures.
  • No single constraint type > 80% of clues at medium/hard/expert. Easy excluded (2-type tier means dominance is structural).
  • 4×4 medium puzzles have 3-15 clues (sanity bound).

3. Perf budgets (median of N runs per size)

Size Budget Runs
3×3 30ms 10
4×4 50ms 10
5×5 100ms 10
6×6 250ms 5
7×7 600ms 5
8×8 1200ms 5

Real measured values are 1-18ms; budgets are ~50-100× to absorb shared-runner variance while still catching an order-of-magnitude regression. JIT warmed locally via beforeAll so reordering describe blocks can't flake the 3×3 budget.

Wiring

  • New regression.yml workflow on ubuntu-latest with Node 24 (single shape on purpose). Image pinning was considered and rejected: the 50-100× headroom in the perf budgets absorbs runner drift, so the pin would trade a real maintenance cost (deprecated images, missed updates) for a hypothetical risk.
  • paths: filter so it only runs when packages/logic-grid/** or its workflow file change. Two consequences documented inline in the file:
    1. Don't make Regression / regression a required status check while the paths: filter is in place — GitHub skips → "missing" → blocks merge.
    2. The split removed the old needs: check short-circuit; regression now runs concurrently with the matrix check. Conscious tradeoff at this PR volume.
  • Difficulty-tier source-of-truth refactor: replaced the asymmetric EASY_TYPES / MEDIUM_TYPES / HARD_ONLY_TYPES exports with typesAtTier(tier) and typesUpToTier(tier) helpers derived from a single TYPE_TIER Record. Adding a new ConstraintType is a TypeScript error in the Record until its tier is decided. Both helpers return memoized ReadonlySet<ConstraintType> (compile-time immutability + zero allocation per call). 100% coverage maintained.

Adds packages/logic-grid/bench/regression.test.ts with strict assertions,
plus a separate \`regression\` job in ci.yml that runs after \`check\` so flaky
perf doesn't block normal PR signals.

Three layers of checks:

1. Difficulty contract — directly tied to EASY_TYPES / MEDIUM_TYPES from
   src/difficulty.ts (+ deduce-based expert promotion). Catches the
   generator silently leaking a higher-tier constraint into a lower tier
   (or vice versa). 50 puzzles per difficulty.

2. Constraint diversity — across 50 hard puzzles, each of the four
   hard-only types (between, not_between, not_next_to, exact_distance)
   must appear at least once; catches silent type-dropouts. Plus a
   no-single-type > 80% sanity check at medium/hard/expert (easy is
   excluded — its 2-type set means dominance is structural).

3. Perf budgets — median of N runs per grid size, calibrated 50-100×
   above current real values to absorb GitHub-runner variance while
   still flagging an order-of-magnitude regression.

The CI job pins to Node 24 only (single perf shape) and \`needs: check\`
so we don't burn bench time when basic tests are red.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 29, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
logic-grid be755ec Commit Preview URL

Branch Preview URL
Apr 29 2026, 02:07 PM

Six review fixes:

- Doc-comment said "Two layers" but enumerated three. Now says "Three".
- HARD_ONLY_TYPES no longer hardcoded in the test. Lifted to difficulty.ts
  alongside ALL_CONSTRAINT_TYPES, derived from it. ALL_CONSTRAINT_TYPES
  uses a Record<ConstraintType, true> so a new ConstraintType variant is
  a TypeScript error rather than silent stale config.
- Surface seed in assertion messages so a 1-of-50 contract failure tells
  you which puzzle to reproduce.
- Drop the redundant explicit warm-up generate() call. Earlier describe
  blocks already do hundreds of generations before perf budgets run;
  added one line of comment in case the order changes later.
- Comment the upper-middle-vs-strict-median quirk so it doesn't read as
  a bug. Note that runs is the first lever to pull if it gets noisy.
- Document on ci.yml why \`needs: check\` waits for the FULL Node matrix
  (intentional — all supported Nodes green before bench), so an
  apparent regression-job hang is debuggable.
ALL_CONSTRAINT_TYPES was an exported intermediate that nothing outside this
file used (the Record was the actual exhaustiveness check; the array form
was just scaffolding). Inline the Record into the HARD_ONLY_TYPES expression
with `satisfies` to keep the compile-time check without leaking a
half-public symbol.
Five fixes:

1. Single source of truth for difficulty tiers. difficulty.ts now uses
   a TYPE_TIER Record<ConstraintType, "easy" | "medium" | "hard"> as the
   only place where the tier of each constraint type is decided. EASY_TYPES,
   MEDIUM_TYPES, and HARD_ONLY_TYPES are derived from it. Adding a new
   ConstraintType is a TS error in the Record until its tier is decided
   (no more "did I remember to update MEDIUM_TYPES too?").

2. Tighten the diversity test name. "every documented hard-only constraint
   type is reachable from hard generation" → "every hard-only type appears
   across 50 seeds at 4×4 hard", with a comment pointing the next debugger
   at the two real failure modes (dropped type vs distribution shift).

3. Local JIT warm-up. perf-budgets describe now has its own beforeAll
   warm-up call instead of relying on earlier describes. Re-ordering or
   splitting the file no longer risks a cold-start 3×3 flake.

4. Easy/medium/hard tests now also assert NOT expert. A regression that
   silently promotes everything to expert (e.g. a deduce change) is now
   caught. Extracted isExpertSolution() helper to keep the assertions tight.

5. Split regression into its own workflow with paths gating. ci.yml now
   contains only check + build (runs on every PR). regression.yml fires
   only when packages/logic-grid/** or its workflow file change. Docs-only
   PRs no longer pay the bench cost or risk perf-budget flakes.
…ession.yml

Two intentional consequences of splitting regression into its own workflow:

1. paths: filter + required-check is a footgun. GitHub skips the workflow
   on filtered PRs, and a skipped required check counts as missing, not
   passed — blocks merge. Don't make this required without adding a
   signal-success-on-skip companion job.

2. Lost the old needs: check gate. Regression now runs concurrently with
   the matrix check; basic-test failures don't short-circuit bench cost.
   Trade-off: parallelism + independent perf signal vs occasional wasted
   runs. Worth it at this PR volume.

No behavior change — comments only.
… helpers

EASY_TYPES (tier-only) and MEDIUM_TYPES (cumulative) had asymmetric
semantics that the names hid; HARD_ONLY_TYPES added a third shape.
Replace all three with two self-documenting helpers derived from a
single TYPE_TIER record:

  typesAtTier("medium")  → just medium-tier types (no easy)
  typesUpToTier("medium") → easy + medium tiers (cumulative)

Both shapes are load-bearing — typesUpToTier in classifyByTypes and
filterByDifficulty (asking "is this type allowed at difficulty X"),
typesAtTier in the regression bench (asking "did each tier-X type
appear"). Now the call site picks the shape it actually wants.

Also factored a TIER_RANK array so classifyByTypes is a single max-rank
loop instead of two flag booleans, and exported a ConstraintTier type
to surface the easy/medium/hard string union for the helpers.

Added unit tests covering both helpers — coverage stays at 100%.

Internal refactor: not in index.ts surface, no breaking change for
external consumers (the constants weren't re-exported there either).
generate() calls filterByDifficulty(), which calls typesUpToTier() — so
the previous implementation allocated a fresh Set on every generation.
Hot path. Move both lookups behind module-level Records so each call is
a single property access. Function shape unchanged for callers.
Five fixes — ubuntu-latest stays (50-100× headroom absorbs runner drift;
pinning trades a real maintenance burden for hypothetical risk).

- typesAtTier / typesUpToTier now return ReadonlySet — TS prevents
  callers from mutating the shared module-state instance.
- Both helpers normalized to ReadonlySet (was: array vs Set asymmetry).
  Test asserts use [...set] so the shape change is invisible to readers.
- Bench-test locals renamed (allowedAtEasy / allowedAtMedium / hardOnly)
  so a future grep for legacy EASY_TYPES / MEDIUM_TYPES / HARD_ONLY_TYPES
  doesn't land on what looks like a live reference.
- Hot-path comment in difficulty.ts reworded — generate() calls
  filterByDifficulty once per puzzle, not once per constraint.
- PR body updated separately via gh pr edit to match the actual workflow
  split (regression.yml is its own file, no needs: check anymore).
Two clear fixes plus a pushback on the third.

1. filterByDifficulty's null-branch comment now mentions expert. The
   branch fires for both \"hard\" and \"expert\" (anything not easy/medium);
   the old comment only said \"hard allows all types\".

2. Reword the expert contract test name. \"fail to fully deduce\" reads
   ambiguously — could be \"test failed\" rather than \"puzzle requires
   backtracking\". New name is unambiguous about the subject (the puzzle).

3. Skip pinning the perf-budget test to a specific difficulty. The
   existing call \`generate({ size, categories, seed })\` measures real
   generate() perf with default options — which is what users hit. Pinning
   to \"hard\" trades that signal for catching only hard-specific regressions
   AND invalidates the existing calibration. Added one line of comment
   explaining the deliberate no-difficulty choice instead.
@antonstefer antonstefer merged commit 529c7f5 into main Apr 29, 2026
5 checks passed
@antonstefer antonstefer deleted the ci/regression-suite branch April 29, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant