Skip to content

perf(ci): shard backend tests + fix per-test full-population reruns (44min → 11.5min)#57

Merged
Taleef7 merged 5 commits into
mainfrom
perf/ci-backend-test-sharding
Jun 3, 2026
Merged

perf(ci): shard backend tests + fix per-test full-population reruns (44min → 11.5min)#57
Taleef7 merged 5 commits into
mainfrom
perf/ci-backend-test-sharding

Conversation

@Taleef7
Copy link
Copy Markdown
Owner

@Taleef7 Taleef7 commented Jun 3, 2026

Problem

CI took ~44 min on every push/PR. The entire time was the backend ./gradlew test step — frontend is ~50s and E2E is manual. Per-class timing showed the cost was concentrated in a few integration tests that re-ran a full-population CQL evaluation (~70s) in @BeforeEach, once per test method.

Changes

1. Fix the per-test waste (the real bug)

  • EvidenceAccessIntegrationTest: ran a full population 14× (1022s) for tests that only need a case to exist and filter audit by their own upload id. Now one shared run via @BeforeAll + @TestInstance(PER_CLASS)71s.
  • CaseFlowRerunIntegrationTest: ran a full population (422s); each test targets a distinct outcome-type case with non-overlapping mutations → one shared run → 146s.
  • ScopedRun / CaseUpsert / Major1 intentionally left unchanged — their reruns are the behavior under test (idempotency, scoped-run parity, audit invariants) and need per-test isolation.

2. Shard across parallel runners

  • Backend job is now an 8-way matrix; build.gradle.kts assigns each test class to a shard by a stable hash (Test.include(Spec)), forks 4-wide within a shard (1.5g heap cap), and only shard 0 writes the Gradle cache.
  • Added a per-class timing diagnostic step for future balancing.
  • Local runs (no shard env) are unchanged.

Result

  • 44 min → 11m30s (≈3.8× faster), all 239 tests pass (verified the shard counts sum to 239 — no tests dropped).
  • Remaining ceiling is ScopedRunIntegrationTest (~635s); a single class runs in one fork, so going under ~10 min would require splitting that class — deferred intentionally.

Note on merging

Merging to main will (a) run the new sharded CI and (b) trigger the standard MIE deploy (now working). The deploy is idempotent, so the redeploy is harmless.

🤖 Generated with Claude Code

Taleef and others added 5 commits June 3, 2026 19:44
The backend `./gradlew test` step is the entire CI bottleneck (~44 min),
dominated by CQL-heavy integration tests (cqf-fhir-cr evaluations across the
synthetic population plus historical-run seeding), previously run ~2-way
parallel on a single runner. Frontend is ~50s and E2E is manual.

Split the suite across a 6-way matrix, each runner executing a deterministic
hash-based subset of test classes (union of all shards = full suite). Within a
shard tests still fork 2-way. Only shard 0 writes the shared Gradle cache to
avoid concurrent-write contention. build.gradle.kts gains overridable
GRADLE_TEST_FORKS and the TEST_SHARD_TOTAL/TEST_SHARD_INDEX selection; with no
shard env (local runs) the full suite runs unchanged.

Also adds perf/** to CI push triggers so this branch self-verifies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ble)

setCandidateClassFiles is not a valid setter in Gradle 9.4.1. Use the Test
task's PatternFilterable include(Spec<FileTreeElement>) predicate — the
documented mechanism for filtering candidate test classes — to assign each
class to a shard by its '/'-separated relative path hash. Directories pass so
the tree is traversed; the classpath is unaffected so @nested discovery and
class loading still work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 6-shard run cut wall-clock 44->25 min but stayed bottlenecked by one
lopsided shard (hash distribution clustered heavy CQL integration classes).
Increase to 8 shards and fork 4-wide within each (ubuntu-latest = 4 vCPU) so
clustered heavy classes overlap; cap per-fork heap at 1.5g so 4 JVMs + their
Postgres containers fit the runner. Add an always-on step that prints per-class
suite durations, so if balance is still uneven we can move to time-weighted
bin-packing with real data.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion tests

The slowest classes weren't a sharding problem — they re-ran a full-population
CQL evaluation (~70s) in @beforeeach, once per test method:

- EvidenceAccessIntegrationTest: 14 tests x full run = 1022s. The evidence
  access/role tests only need a case to exist and filter audit by their own
  upload id, so they share one population run via @BeforeAll + @testinstance(
  PER_CLASS). ~1022s -> ~90s.
- CaseFlowRerunIntegrationTest: 5 tests x full run = 422s. Each test targets a
  distinct outcome-type case with non-overlapping mutations, so one shared run
  is sufficient. ~422s -> ~140s.

ScopedRun/CaseUpsert/Major1 are intentionally left as-is: their reruns are the
behavior under test (idempotency, rerun-to-verify, empty-table historical seed)
and need per-test isolation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
workwell-measure-studio Ready Ready Preview, Comment Jun 3, 2026 4:54pm

@Taleef7 Taleef7 self-assigned this Jun 3, 2026
@Taleef7 Taleef7 merged commit e64d18b into main Jun 3, 2026
30 checks passed
@Taleef7 Taleef7 deleted the perf/ci-backend-test-sharding branch June 3, 2026 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant