fix(#1724): chunk tx_last_seen_backfill — bounded reader-yield + progress surface#1735
fix(#1724): chunk tx_last_seen_backfill — bounded reader-yield + progress surface#1735Kpa-clawbot wants to merge 12 commits into
Conversation
Adds the failing test suite for the new chunkedTxLastSeenBackfill helper that will replace the single-statement #1690 backfill in the next commit. Tests pin the contract reviewers flagged on the prior attempt: - Reader yields between batches (concurrent reader latency bounded — a single-tx fake would NOT satisfy this). - With seedN=12000 + batchSize=5000, progress callback fires >=3 times. - ctx cancel mid-loop -> context.Canceled + partial commits visible. - Concurrent INSERT of new last_seen=0 rows does not trap the loop (maxID snapshot bounds the scan). - Orphan transmissions (no observations) are skipped via EXISTS so the loop terminates deterministically. - Param validation: batchSize<=0 and negative yieldDelay are rejected (no <0 sentinel). - Error propagation: closed DB surfaces -> migration cannot silently report success. Includes a minimal stub of chunkedTxLastSeenBackfill (returns zero/nil) so the file compiles and the tests run to their assertions. The GREEN commit replaces the stub with the real chunked implementation.
Replaces the single correlated UPDATE used by tx_last_seen_backfill_v1 (introduced in #1690) with a chunked loop that yields the single SQLite writer between batches. Symptom (pre-fix, operator scale ~71K tx / 1.5M obs / 2GB DB): - backgroundLoadComplete=true fires. - The async migration starts the single full-table UPDATE under SetMaxOpenConns(1), holds the writer for 10-15 minutes. - Every /api/healthz, /api/packets, /api/stats request queues behind sqlite_busy_timeout. UI appears frozen long after warm-up clears. Fix (this commit): - cmd/ingestor/tx_last_seen_backfill.go (new): chunkedTxLastSeenBackfill snapshots MAX(id), counts eligible rows (last_seen=0 AND has observations AND id<=maxID), then loops bounded UPDATEs (batchSize=5000) with time.NewTimer-based sleeps (no Timer leak via time.After) between batches (yieldDelay=100ms). EXISTS gate skips orphan transmissions so the loop terminates. maxID snapshot keeps concurrent INSERTs out of scope (those are handled inline by stmtBumpTxLastSeen on the writer fast path). Ctx cancellation between batches returns context.Canceled with partial counts; partial commits are visible (migration does NOT flip to done). All errors propagate (snapshot, count, UPDATE, RowsAffected) — the migration cannot silently mark itself done. Progress callback fires per non-empty batch + once terminal with final stable counts; never on a stale n=0 batch. - cmd/ingestor/db.go: wire the helper into the tx_last_seen_backfill_v1 async migration, explicit batchSize=5000, yieldDelay=100ms. Math reality-check: ~71K tx / 5000 ≈ 15 batches × (~50ms exec + 100ms yield) ≈ ~2.5s wall time with readers slotted in at most every 150ms. PR #1725's description claimed ~300 batches × 150ms ≈ 45s — that confused observations (1.5M) with transmissions (71K); real number is ~20x smaller. Indexes idx_tx_last_seen (transmissions(last_seen)) and idx_observations_transmission_id already exist (see internal/dbschema and cmd/ingestor/db.go base schema) — no additional index work required at this commit. Tests: cmd/ingestor/tx_last_seen_backfill_test.go (added in prior commit) pin all the contract points reviewers flagged on PR #1725. Cancel-mid-loop test timing widened from 30ms to 250ms to give the real chunked impl room to commit a batch before the cancel fires; assertion semantics unchanged (partial commits + context.Canceled + no full completion).
… reset Adds an observational progress surface to _async_migrations so a long-running async migration (in particular tx_last_seen_backfill_v1 on operator-scale cold-load) is no longer opaque to readers. Schema changes (additive on legacy DBs): - _async_migrations.rows_processed (INTEGER NOT NULL DEFAULT 0) - _async_migrations.rows_total (INTEGER NOT NULL DEFAULT 0) - _async_migrations.last_update_at (TEXT) ensureAsyncMigrationProgressColumns runs ADD COLUMN per column and ONLY swallows the SQLite "duplicate column" error — every other ALTER failure propagates so a real schema problem doesn't get hidden. The CREATE TABLE body carries the same columns for fresh installs. recordAsyncMigrationProgress rate-limits writes to <=1/sec per migration name via a per-name time.Time cache; the rate limit is intentionally NOT a sync.Map so the bookkeeping table doesn't see a write per backfill batch (which on a SetMaxOpenConns(1) DB would compete with the migration's own UPDATE for the writer lock). recordAsyncMigrationProgressTerminal forces a write past the limiter — used to pin final stable counts on both success and failure paths so observers see the final point at which the migration stopped, not stale intermediate data. Retry path (RunAsyncMigration on an existing pending_async or failed row) resets rows_processed / rows_total / last_update_at to zero AND clears the in-memory rate-limit cache, so the next run starts with an honest denominator and no suppressed first write. A single sync.Once guards the warn log for the legacy "progress columns missing" path so a misconfigured DB doesn't generate one log line per batch. db.go wires both the periodic and terminal progress writes into the tx_last_seen_backfill_v1 migration. Failures still propagate to the RunAsyncMigration goroutine (status flips to 'failed' with the error message); the terminal write captures the partial counts at the failure point.
The pr-preflight async-migration gate flags any new ALTER TABLE / CREATE TABLE in a migration-shaped file without an explicit annotation. Two sites are legitimately safe-at-scale but lacked the annotation: - cmd/ingestor/async_migration_progress.go ADD COLUMN on the bookkeeping table _async_migrations (single-digit rows; ADD COLUMN is O(rows)). - cmd/server/async_migrations_test.go CREATE TABLE on a fresh in-memory test DB (test setup, not a real schema migration). Annotation-only — no behavior change. Both call sites already had runtime safeguards (duplicate-column tolerance, test isolation). cross-stack: justified — annotations only; no functional change. PR #1735 already declares the frontend+backend coupling.
Independent review (round 1)Reviewer: independent adversarial pass, cold context. Cross-checked against the 38 findings that closed #1725. Verdict: NEEDS-WORK — the re-cut addresses ~all of the prior round-1 must-fixes (orphan-tx loop, RowsAffected error, retry-resets-progress, time.NewTimer, ORDER BY id, additive CREATE TABLE body, mapAsyncStatus default=unknown, rate-limited progress writes, TTL cache, banner shows failed, ETA/rate arithmetic assertions, JS stay-up test, scope cleaned to 13 files). Good work on that front. Remaining must-fixes are smaller but real. Must-fix
Out-of-scope
TDD verificationRed commit VerdictNEEDS-WORK — 10 must-fixes, all small. The structural rework (orphan-safe, RowsAffected propagation, retry resets progress, JS banner gating, real ETA/rate tests) lands the prior round-1 critique. Remaining items are cache-correctness (must-fix #1/#2), error-observability (#3/#4/#5), driver fragility (#6), one missing handler test (#8), and minor hygiene (#7/#9/#10). Address #1–#5 and this is mergeable; #6–#10 can land in a follow-up if you want to ship the cold-load fix sooner. — Independent review, round 1. |
Munger Review (round 1)
The chunked backfill itself is solid — single‑writer ordering is preserved, the maxID snapshot bounds the loop, the orphan EXISTS terminates, the TDD red→green is real. That's the easy half. The observability surface around it is where this PR fails its own acceptance criteria. Six must‑fixes, all on the read/surface side, several of them direct repeats of the original #1725 review findings. Must‑fix (6)1. if infos, err := readAsyncMigrations(s.db.conn); err == nil {
asyncMigrations = infos
}If 2. for (var i = 0; i < migs.length; i++) {
if (migs[i] && migs[i].status === 'failed') return false;
}
3. asyncMigrationsCacheMu.Lock()
defer asyncMigrationsCacheMu.Unlock()
if !asyncMigrationsCacheAt.IsZero() && ...
out, err := readAsyncMigrationsRaw(db) // <-- DB query under the mutexLollapalooza: chunked UPDATE holds the writer → readers contend for 4. progressSchemaWarnOnce sync.Once
...
progressSchemaWarnOnce.Do(func() {
log.Printf("[async-migration] progress write failed (likely missing columns; further such errors suppressed): %v", err)
})Combined with #5 below, this means: schema drift or any persistent write failure → exactly one log line ever → zero 5. _ = recordAsyncMigrationProgress(d, "tx_last_seen_backfill_v1", p, t)
...
_ = recordAsyncMigrationProgressTerminal(d, "tx_last_seen_backfill_v1", processed, total)
6. asyncMigrationsCacheErr error
...
out, err := readAsyncMigrationsRaw(db)
asyncMigrationsCached = out
asyncMigrationsCacheErr = errNeither Out of scope
VerdictThe chunked backfill itself ships. The progress/banner surface — the whole reason this is a multi‑file cross‑stack change — has the same class of "silently lies under contention" defects that sank #1725. Fix #1, #2, #4, #5 before merge; #3 and #6 are smaller but in the same family and worth doing in the same round. "All I want to know is where I'm going to die, so I'll never go there." Right now this PR dies at the moment of peak contention — the very moment the banner is supposed to be telling the truth. |
Kent Beck Gate (round 1) — TDD + test qualityTDD red→green history: VERIFIED (by inspection)
Later commits assessed under AGENTS.md exemptions:
Six Questions on the four test filesa. Fails on revert? Yes for all four — stub-vs-impl diff makes 5+ assertions flip per file. Verification of prior round-1 (#1725) findings
Must-fix (2)1. BLOCKER — The whole reason this test exists (per the recut brief) is to prevent a "fake chunked" implementation that's actually a single-tx UPDATE from passing. As written:
A single-correlated-UPDATE fake on 12,000 in-memory rows on modern hardware completes in well under 500ms. The reader's The brief was explicit: this assertion must be airtight against a single-tx fake. Options:
Option (b) is closest to "behavior-driven" — it asserts the property operators actually care about (reader p95 not dominated by the migration). 2. MAJOR — Progress callback panic poisons the migration goroutine. The impl unconditionally calls Smallest test: func TestChunkedBackfill_ProgressPanicDoesNotPoisonMigration(t *testing.T) {
s := newTestStore(t); seedTransmissions(t, s, 100)
panicky := func(int64, int64) { panic("boom") }
_, _, err := chunkedTxLastSeenBackfill(ctx, s.db, 50, 0, panicky)
// must return an error, not crash; migration row must mark 'failed'
}Either add Out of scope (acknowledged, not blocking)
Verdict: NEEDS-WORK (1 BLOCKER, 1 MAJOR)TDD discipline is otherwise solid for this branch. The reader-yield test is the load-bearing assertion in this PR's TDD claim — it cannot be loose against the failure mode it exists to prevent. |
Carmack Review (round 1)Re-cut of #1725. Walking the chunked backfill, the progress writer, the server-side reader, and the banner gate. Compared against the 8 findings on the prior PR. Verdict: NEEDS-WORK (small). All 8 prior Carmack findings from #1725 are addressed (math fixed, Must-fix1. Both healthz and perf silently swallow
Fix: in healthz, return an This contradicts the PR body's own acceptance claim, so it's must-fix not nit. 2. Cheap fix: guard the whole function with a Out-of-scope / follow-up
Math reality-check (PR's "~2-3s" claim)
The math holds. The fix shape is right. The two must-fix items above are the polish that gets this merged. — Carmack-flavored review, round 1. |
…e progress write failures Group A from PR #1735 round-1 review (must-fix #1, #5, #6, #7). - cmd/server/healthz.go: on readAsyncMigrations error, include the message in the JSON body as async_migrations_error AND keep async_migrations_running=true. Fail closed for warm-up: if we can't read the bookkeeping table, treat the system as possibly still warming up rather than declaring 'all clear'. - cmd/server/async_migrations.go handlePerfAsyncMigrations: return HTTP 500 with the error body on readAsyncMigrations failure instead of silently returning an empty list. (Empty list is a meaningful operator signal; a query failure must be visible.) - cmd/server/routes.go /api/perf: log the readAsyncMigrations error and surface it via X-Async-Migrations-Error response header so the rest of the perf payload still flows. - cmd/server/async_migrations.go: delete the unread asyncMigrationsCacheErr field (finding #5). - cmd/server/async_migrations.go parseAsyncTime: propagate parse errors to the caller; readAsyncMigrationsRaw now appends them to ErrorMessage so unparseable timestamps don't silently produce 0s. - cmd/ingestor/async_migration_progress.go recordAsyncMigrationProgressEx: check RowsAffected(); 0 rows updated -> error (bookkeeping row missing). cmd/ingestor/db.go: track in-loop progress write failures, log them, and treat a failed TERMINAL progress write as a failed migration (counts are no longer trustworthy).
…-limited warn log, pinned driver-string test Group D from PR #1735 round-1 review (must-fix #8, #9). - ensureAsyncMigrationProgressColumns: guard with sync.Once so a process that runs many async migrations doesn't re-run 3 ALTER TABLE statements every call. The column set is fixed at build time, so once-per-process is the correct scope. - Remove progressSchemaWarnOnce (sync.Once) for the per-write warn log. Replace with a wall-clock rate-limiter (1/min). sync.Once silenced all future errors — destroying observability of an ongoing problem. The rate-limited approach lets every error remain visible without flooding the log on rapid retries. - isDuplicateColumnErr: the modernc.org/sqlite driver does not expose a typed sentinel for duplicate-column ADD COLUMN failures. Document why the substring match is correct AND add TestIsDuplicateColumnErr_DriverStringPinned which provokes the actual driver error so a future driver upgrade that changes the wording fails CI loudly. - Add TestEnsureAsyncMigrationProgressColumns_RunsOncePerProcess pinning the sync.Once behavior + a resetEnsureColumnsOnceForTest helper for test isolation.
… recover panicking progress callback Group E from PR #1735 round-1 review (must-fix #10, #14). - chunkedTxLastSeenBackfill: track lastFired (p, total) and skip the final terminal callback when it would re-fire identical counts already reported by the last in-loop fire. Previously, when the last batch was exactly batchSize-sized, the next chunk returned n=0 and we fired (processed,total) a second time. Operators saw duplicate progress events. - Wrap the progress callback in defer-recover. A panicking callback (operator-supplied or buggy bookkeeping write) is converted to an error and returned, NOT propagated to the ingestor goroutine. RunAsyncMigration already converts a returned error to status=failed with the message in the error column, so end-to-end the migration is properly marked failed with the recovered panic text. Tests added: TestChunkedBackfill_TerminalSuppressedWhenRedundant TestChunkedBackfill_PanicInCallbackRecovered TestChunkedBackfill_PanicViaRunAsyncMigrationMarksFailed
…r-yield assertion + orphan-tx test doc Group F from PR #1735 round-1 review (must-fix #11, #12, #13). #11 — Add cmd/server/async_migrations_handler_test.go covering the four states of /api/perf/async-migrations: - success with rows: 200 + JSON array - empty list: 200 + '[]' (not 'null', so warmup-banner.js can iterate) - readAsyncMigrations error: HTTP 500 + JSON error body (not silently empty — that was the round-1 must-fix) - nil db (server pre-DB-init): 200 + '[]' #13 (kent-beck BLOCKER) — TestChunkedBackfill_YieldsToReaderBetweenBatches: the original threshold (12K rows, 500ms reader-latency bound) was loose enough that a single-tx fake whose total wall time was <500ms could pass. Tightened to: - sample BASELINE reader latency BEFORE backfill starts (avg of 5 probes) - sample BEST reader latency during backfill - assert bestDuring < 80ms absolute AND ratio < 5x baseline (with 5ms floor to avoid sub-ms flakiness) A single-tx implementation that holds the writer the entire wall time would push the during-latency ratio into the 50-100x range and fail deterministically. Comment in the test body explains why. #12 — TestChunkedBackfill_OrphanTxTerminates: doc-only — explain why the orphan insert and seedTransmissions run in separate transactional contexts (orphan has no observation row; can't share seed's tx; the backfill loop is committed-state-only so the split has no effect on what's being asserted).
Group B from PR #1735 round-1 review (must-fix #2). Previously a failed async migration pinned the banner forever: isSteadyState returned false as long as any migration was in 'failed' status, with no path to clear. Operators lost trust in the banner; real new failures got lost in the noise. Fix: - FAILED_AUTO_DISMISS_MS = 10 min from endedAt — past that window the failed entry auto-clears from the banner. The failure is still visible via /api/perf/async-migrations and /api/healthz; only the banner stops blocking. - Per-line × button: explicit user ack immediately removes the failure from the banner. - Fail closed: if endedAt is missing or unparseable, the failure does NOT auto-dismiss (operator must see it). - isSteadyState gets an optional nowMs param (defaults to Date.now) for testability and to make the auto-dismiss math re-render-deterministic. CSS additions: .warmup-banner__item--failed coloring + .warmup-banner__dismiss button styling using existing CSS variable patterns. Tests added: test-warmup-banner-failed-dismiss-1735.js pins: - within window: failure still blocks steady state + appears in messages - past window: failure auto-clears from both - explicit dismiss: immediate removal - missing/malformed endedAt: fails closed (no auto-dismiss)
The PREFLIGHT migration-scale gate flags every ALTER TABLE statement in the repo unless it carries the async=true annotation. The new TestIsDuplicateColumnErr_DriverStringPinned test runs ALTER on an in-memory DB to provoke and pin the driver's duplicate-column error wording — surgical addition to keep the gate green.
Round-1 review consolidated must-fix — all 14 findings addressedPushed 6 commits (one per reviewer-group) addressing every must-fix from round 1 (adversarial + munger + carmack + kent-beck). No redesign — surgical fixes only. Fix map
Commits (regular push, no force)
Test resultsLocal
JS pure tests:
Out-of-scope items filed (Group G)Reviewers raised six items explicitly out of scope for this PR. Filed as separate issues, linked from PR review threads:
Notes / pushback
|
Problem
v3.9.2 cold-load on a real-size operator DB (71K tx / 1.5M obs / ~2 GB) wedged the reader path for 10–15 min after
backgroundLoadComplete=trueflipped. Thetx_last_seen_backfill_v1async migration ran as one big correlatedUPDATEagainstobservations, and SQLite serialized every reader behind the writer.From #1724's perf snapshot (~46 min uptime on freshly started v3.9.2):
/api/statsis in-memory; 213s max = pure writer-lock starvation.What changed
Ingestor (writer side)
cmd/ingestor/tx_last_seen_backfill.go— replaces the single correlated UPDATE with a chunked, bounded reader-yielding loop. Walkstransmissionsbyrowidin batches of 5000, computesMAX(observations.timestamp)per batch, writes back via short transactions, and sleeps a fixed yield between batches so readers can win the writer lock.cmd/ingestor/async_migration_progress.go— rate-limitedrows_processed/rows_total/last_update_atwriteback (≤1 row/s) so the progress columns don't themselves become a hot-path writer. Resets retry state on success.cmd/ingestor/async_migration.go+db.go— schema migration addsrows_processed,rows_total,last_update_atcolumns.Server (read-only API surface)
cmd/server/async_migrations.go— read-only TTL-cached query over_async_migrations. ComputeselapsedSec/etaSec/ratePerSecper migration.mapAsyncStatusnormalizes raw status strings intorunning/done/failed/unknown. Strict read-only — invariant test in bug(db): vacuumOnStartup fails with SQLITE_BUSY when ingestor + server share DB (auto_vacuum migration #919 broken in single-container topology) #1283 still holds.cmd/server/healthz.go—/api/healthznow embedsasync_migrations: [...]and a scalarasync_migrations_running: bool. Failures are reported (not silently dropped).cmd/server/routes.go+types.go—/api/perfexposes the same async-migration array so operators can distinguish "backfill running" vs "cold-load" vs "real bug" (acceptance criterion LCD ghost color regex fails on hex colors #3).Frontend
public/warmup-banner.js— banner stays up whilehealthz.async_migrations_running=true(acceptance home.js stacks duplicate event listeners on re-render #4). Renders a per-migration progress line (name,rowsProcessed / rowsTotal, ETA seconds).failedmigrations are surfaced explicitly with their error message; we do NOT silently drop.isSteadyStatenow gates on no-running and no-failed in addition to the existing predicates.test-warmup-banner-migrations.js— pins stay-up behavior, per-migration line format, failed-state surfacing, and back-compat (noasync_migrationsfield at all).TDD red → green
cb6bab57—test(#1724): RED — chunked tx_last_seen backfill behavior + edges915b1011—fix(#1724): chunk tx_last_seen_backfill with bounded reader yieldThe red commit's tests fail on assertions ("expected chunked progression, observed single UPDATE", reader-yield checks) and pass on the green commit. CI history shows red→green ordering.
Real math (corrected from prior closed attempt)
The 45s wall-time figure quoted in the earlier (closed) recut was wrong. Recomputed:
MAX(observations.timestamp)correlated subqueryThe point of this change is NOT to make the backfill itself faster — it's to stop it from monopolizing the writer lock. The chunked path is intentionally slower in wall-clock and faster in worst-case reader p95.
Acceptance map (issue #1724)
sqlite_busy_timeout. Validated by ingestortx_last_seen_backfill_test.go(assertions on batched progression + yield gaps).tx_last_seen_backfill.godoes exactly this; no more single correlated UPDATE./api/perfexposes progress (%/rows-per-sec/ETA) —async_migrationsarray on/api/perfand/api/healthz;ratePerSec/etaSec/rowsProcessed/rowsTotalper migration.isSteadyStatenow checksasync_migrations_running+ nofailedmigration; tests pin this.Out of scope (intentional)
batchSize=5000and the yield-delay are hardcoded. Making them runtime-tunable is a follow-up — would need a new config surface and is not required for the regression fix.TestBoundedLoad_OldestLoadedSetincmd/server/bounded_load_test.gohangs indefinitely undergo test -short. It is NOT introduced by this PR — the goroutine dump points atcreateTestDBAt(lines around 349) which is unrelated to any file this PR touches. Targeted runs of the newTestReadAsyncMigrations*/TestAnyAsyncMigrationRunning*/TestMapAsyncStatustests all pass in under 100ms. Filing this hang as a separate issue is recommended.cross-stack: justified
Ingestor + server + frontend land together because:
/api/healthz+/api/perf; without that, the frontend banner has no signal to gate on.Splitting these would leave master with broken acceptance criteria mid-merge.
Fixes #1724