test(ublk): rapid property-based state-machine tests#27
Conversation
…ycle and isolation
Adds pgregory.net/rapid v1.3.0 as a test dependency and a new
TestRapidStateMachine in ublk/rapid_integration_test.go.
The state machine drives random sequences of create/write/read/fsync/
close actions against up to two live ublk devices and an in-process
shadow model. Invariants checked after every action:
1. Read returns bytes from the most recent Write (per device).
2. Bytes written to device A never appear at the same offset on
device B (cross-device isolation).
3. Close terminates within a 5 s timer (a hang in del_gendisk would
otherwise deadlock the test rather than report a failure).
4. Close is idempotent — a second call must not panic or hang.
User fds on /dev/ublkbN are closed before Device.Close (AGENTS.md
fd-close-before-Close discipline) so del_gendisk does not block.
Adds a make target test-rapid for filtered local iteration. The new
test runs as part of the existing test-integration CI job; no new CI
job is needed. TODO.md's "Property-based / model-based state machine
tests (rapid)" item is marked done with a pointer to the new file.
PR SummaryMedium Risk Overview Extends developer tooling with Reviewed by Cursor Bugbot for commit ef6a719. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 02ac51e. Configure here.
The rapidMaxCreates cap was unconditional, so once a Run hit it AND the last live device was closed, every subsequent action skipped (create for the cap, read/write/fsync/close for 'no live devices'). rapid then fails the Run with 'can't find a valid (non-skipped) action'. Reproduced on CI: --- FAIL: TestRapidStateMachine after ~30s of churn that closed all devices and tried to grow past the cap. Lift the cap when len(live) == 0 so create is always available as a recovery path. The cap still bounds runtime in the common case where at least one device is alive.
## Summary The Azure-hosted apt mirror (\`azure.archive.ubuntu.com\`) returns \`Temporary failure resolving\` periodically — observed today on PRs #24, #26, and #27, all blocked by the same DNS blip while installing \`linux-modules-extra\`. This PR wraps the \`apt-get update\` + install in a 5-attempt exponential backoff (5s, 10s, 20s, 40s, 80s — ~155s worst case before giving up). The fast path (\`modprobe ublk_drv\` succeeding because the module is already present in the runner image) is unchanged. No code changes; CI workflow only. ## Test plan - [ ] CI run for this branch loads ublk_drv and runs integration tests cleanly. - [ ] If the apt mirror flakes during the run, the retry should absorb it; logs will show \`attempt N failed; sleeping...\`.
**Stacked on:** #27 (`tests/rapid-state-machine`). Adds `TestRapidLinearizability` (in `ublk/porcupine_integration_test.go`) — a porcupine-driven linearizability checker for concurrent reads and writes against a single ublk device. ## Why this catches things rapid alone doesn't `TestRapidStateMachine` (PR #27) checks a per-operation invariant: after every action, the device's bytes match the model's shadow. That's sufficient when the test driver issues actions sequentially — which it does, because rapid state machines are sequential by construction. This PR addresses a different question: **can the global real-time history of concurrent ops be explained by some valid sequential ordering?** That's linearizability, the same property Jepsen checks for distributed databases. A history can pass the per-operation invariant on every read taken in isolation and still be non-linearizable — e.g. if two concurrent writes' effects are observed in inconsistent orders by later reads. ## Implementation choice — Option B Both options outlined in the spec were on the table: - **Option A** would have instrumented `TestRapidStateMachine` to record an operation history. The problem: rapid drives the actions strictly sequentially, so the history is trivially linearizable and the porcupine check is pure overhead. - **Option B** (chosen): a standalone test in its own file driving a concurrent worker pool. The rapid state machine is preserved untouched as the per-operation invariant checker; this test is the global-ordering checker. Cleaner separation of concerns and avoids forcing PR #27's test into a shape it doesn't want. ## The model One register per 4 KiB block (`map[int]uint64` — block index → most-recent stamp). Each write embeds a unique 8-byte stamp (from a global atomic counter starting at 1) at the start of its 4 KiB block; reads recover the stamp from the bytes returned. Reads/writes are constrained to a single block at a time so each op is atomic from the model's perspective. Stamp 0 is reserved for "never written", which matches the all-zero bytes the device returns for unwritten blocks. ## The workload - Single 256 KiB device (64 blocks of 4 KiB). - Default 4 concurrent goroutines × 50 ops each (200 ops total). - Each op: - `Call = time.Now()` recorded immediately before the syscall. - Issues `unix.Pread`/`unix.Pwrite` against `/dev/ublkbN` with `O_DIRECT` and `alignedBuf`. - `Return = time.Now()` recorded immediately after. - Appended to a mutex-protected `[]porcupine.Operation`. - After the workload phase: `porcupine.CheckOperationsVerbose(model, history, 30s)`. Illegal histories are rendered to a HTML visualization via `porcupine.VisualizePath` and the test fails with the path logged. `Unknown` (timeout) is logged as a soft pass with guidance to either shrink the history or grow the budget. Tunables: `UBLK_LINZ_OPS` (default 200) and `UBLK_LINZ_WORKERS` (default 4). ## Tooling - `make test-linz` runs only this test against an integration-tagged binary. - No new CI job needed: the existing `test-integration` job already runs every `//go:build integration` test in `./ublk/`. Verified in `.github/workflows/ci.yml`. - `TODO.md` "Linearizability checking" bullet replaced with a `(done)` summary. - Pinned dependency: `github.com/anishathalye/porcupine v1.1.0`. ## fd-close-before-Close discipline Per `AGENTS.md`: the user fd opened on `/dev/ublkbN` is closed before `dev.Close()` (in a `t.Cleanup`), otherwise `del_gendisk` blocks waiting for the open ref to drop. Documented inline. ## Test plan - [x] `go vet ./...` - [x] `golangci-lint run ./...` → `0 issues.` - [x] `go test -count=1 -race ./ublk/uring/ ./ublk/` - [x] `go test -c -tags=integration -o /tmp/ublk.test ./ublk/` compiles - [x] `gofmt -l .` empty - [x] `go mod tidy -diff` clean - [x] CI's `test-integration` job exercises `TestRapidLinearizability` end-to-end on a host with `ublk_drv` + root. - [ ] (Optional) Manual sanity check: `make test-linz` on a kernel host; should pass and log `linearizable: 200 ops checked in <Xs>`. ## Forbidden checks - No library code touched (`ublk.go`, `worker.go`, `device.go`, `types.go`, `ublk/uring/*` untouched). - PR #1 (`fuzz_test.go`) and PR #2 (`chaos_integration_test.go`) files not modified. - No `t.Skip` for missing root or kernel. - `TestRapidStateMachine` and the rest of PR #27 untouched (additive only).

Summary
Adds
pgregory.net/rapidv1.3.0 as a test dependency and a newTestRapidStateMachineintegration test inublk/rapid_integration_test.go.The state machine uses rapid's
t.RepeatAPI to generate pseudo-random sequences of five actions:create—ublk.Newa fresh device (capped at 2 live, 16 total per Run, ~10 ms each on a warm host)write—pwrite(O_DIRECT)random data at a block-aligned offset on a randomly chosen live device, then update the per-device shadowread—pread(O_DIRECT)and assert bytes equal the per-device shadowfsync—unix.Fsyncon a live deviceclose— full Close cycle: close the user fd first, rundev.Close()under a 5 s timer, then calldev.Close()a second time and assert it does not hang or errorA separate empty-key handler runs after every action and probes one block on every live device against the shadow.
Constraints (kept tight to make each Run cheap):
{512, 4096, 8192}, offsets always block-alignedInvariants checked
Readreturns bytes from the most recentWriteat that range (per device).Closeterminates within 5 s (closeWithDeadlinerunsClosein a goroutine and times out viaselect— a hang indel_gendiskwould otherwise deadlock rather than report a shrinkable failure).Closeis idempotent — a second call toClosemust not panic or hang (mirrorsTestCloseIdempotent).fd-close-before-Close discipline
Per
AGENTS.md: any test that opens/dev/ublkbNmust close that fd before callingdev.Close(), otherwisedel_gendiskblocks indefinitely. The state machine'scloseDeviceaction doesunix.Close(fd)first, and the per-Runcleanup(defer) does the same for any devices left alive at the end of a Run.Why this is distinct from
TestTortureRandomIOTestTortureRandomIOis a long-running soak with fixed structure — N workers on disjoint regions of one device, doing random I/O within their region for the duration of the run. It does not exercise lifecycle transitions (only one device, no close mid-stream), and when it fails it gives you the failing op as-is — no minimization.TestRapidStateMachinegenerates arbitrary command sequences including lifecycle transitions (create / close mid-stream, multiple devices), and rapid's automatic shrinking reduces any failing sequence to a minimal reproducer. That's the primary value: a 1000-action failing case shrinks to (typically) a handful of actions you can stare at.Tooling
make test-rapid— builds the integration binary and runs onlyTestRapid*. For local iteration on a shrunk failing case.test-integrationCI job; no new CI job needed.-rapid.checks=NorRAPID_CHECKS=N. Seego test -args -hfor the full list.Test plan
go vet ./...— cleangolangci-lint run ./...—0 issues.go test -count=1 -race ./ublk/uring/ ./ublk/— passgo test -c -tags=integration -o /tmp/ublk.test ./ublk/— compilesgofmt -l .— emptygo mod tidy— cleanublk_drv: deferred to CI'stest-integrationjob (requires kernel module + privilege not available in the dev sandbox).