test(ublk): porcupine linearizability on concurrent reads/writes#28
Conversation
…ycle and isolation
Adds pgregory.net/rapid v1.3.0 as a test dependency and a new
TestRapidStateMachine in ublk/rapid_integration_test.go.
The state machine drives random sequences of create/write/read/fsync/
close actions against up to two live ublk devices and an in-process
shadow model. Invariants checked after every action:
1. Read returns bytes from the most recent Write (per device).
2. Bytes written to device A never appear at the same offset on
device B (cross-device isolation).
3. Close terminates within a 5 s timer (a hang in del_gendisk would
otherwise deadlock the test rather than report a failure).
4. Close is idempotent — a second call must not panic or hang.
User fds on /dev/ublkbN are closed before Device.Close (AGENTS.md
fd-close-before-Close discipline) so del_gendisk does not block.
Adds a make target test-rapid for filtered local iteration. The new
test runs as part of the existing test-integration CI job; no new CI
job is needed. TODO.md's "Property-based / model-based state machine
tests (rapid)" item is marked done with a pointer to the new file.
PR SummaryLow Risk Overview Updates Reviewed by Cursor Bugbot for commit 94bf7da. Bugbot is set up for automated code reviews on this repo. Configure here. |
The rapidMaxCreates cap was unconditional, so once a Run hit it AND the last live device was closed, every subsequent action skipped (create for the cap, read/write/fsync/close for 'no live devices'). rapid then fails the Run with 'can't find a valid (non-skipped) action'. Reproduced on CI: --- FAIL: TestRapidStateMachine after ~30s of churn that closed all devices and tried to grow past the cap. Lift the cap when len(live) == 0 so create is always available as a recovery path. The cap still bounds runtime in the common case where at least one device is alive.
e6f5a16 to
45d783e
Compare
… workload Adds TestRapidLinearizability — a concurrent worker-pool harness that records a wall-clock history of pread/pwrite ops against a single device and feeds it to anishathalye/porcupine to decide whether the history admits a valid sequential explanation. The model is one register per 4 KiB block (map[int]uint64, block → most-recent stamp). Each write embeds a unique 8-byte stamp at the start of its block; reads recover it. Default workload: 4 workers, 200 ops total, 256 KiB device (64 blocks). Tunable via UBLK_LINZ_OPS and UBLK_LINZ_WORKERS. Implemented in a separate file rather than instrumenting the existing TestRapidStateMachine because rapid drives a strictly sequential state machine — a sequential history is trivially linearizable. The new test runs alongside the rapid one in the existing test-integration CI job (same `integration` build tag), and ships a `make test-linz` target for iterating on the model or workload in isolation.
…ility Bugbot pointed out that `TestRapidLinearizability` was being captured by the existing `test-rapid` Makefile target's `-test.run=TestRapid` regex even though the test does not use the rapid library. The target's stated purpose is "run only the rapid property-based state-machine tests" for quick iteration on shrunk failures, so having porcupine sneak in adds up to 60s of unrelated work to that fast loop. Rename the function (and its doc comment) so it no longer matches the TestRapid prefix, and update the `test-linz` Makefile target's `-test.run` regex accordingly. Reference to TestRapidStateMachine inside the doc comment is intentional — it is the sequential companion test, still named that way, and porcupine remains its "global ordering" follow-up. Reported by Cursor Bugbot on PR #28.
45d783e to
5698aca
Compare
Keep the porcupine target and dependency alongside the newer rapid test support so the branch matches current main and merges cleanly.
Bring in the merged uring fuzz helpers alongside the porcupine and rapid test targets so the branch stays conflict-free on the current main.
Bring in the merged chaos test helpers alongside the porcupine, rapid, and fuzz helpers so the branch stays conflict-free on the current main.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.
Reviewed by Cursor Bugbot for commit 94bf7da. Configure here.
| // fail — porcupine is NP-hard and false-timeouts are real. | ||
| t.Logf("linearizability check TIMED OUT after %v (history len=%d) "+ | ||
| "— increase UBLK_LINZ_OPS to keep histories small, or "+ | ||
| "raise the checkTimeout in this test", checkTimeout, len(history)) |
There was a problem hiding this comment.
Timeout guidance says "increase" but means "decrease"
Low Severity
The timeout log message says "increase UBLK_LINZ_OPS to keep histories small" but UBLK_LINZ_OPS controls the total number of operations — increasing it makes the history larger, making timeouts more likely. The guidance is inverted and would lead a developer to worsen the problem. It needs to say "decrease".
Reviewed by Cursor Bugbot for commit 94bf7da. Configure here.


Stacked on: #27 (
tests/rapid-state-machine).Adds
TestRapidLinearizability(inublk/porcupine_integration_test.go) — a porcupine-driven linearizability checker for concurrent reads and writes against a single ublk device.Why this catches things rapid alone doesn't
TestRapidStateMachine(PR #27) checks a per-operation invariant: after every action, the device's bytes match the model's shadow. That's sufficient when the test driver issues actions sequentially — which it does, because rapid state machines are sequential by construction.This PR addresses a different question: can the global real-time history of concurrent ops be explained by some valid sequential ordering? That's linearizability, the same property Jepsen checks for distributed databases. A history can pass the per-operation invariant on every read taken in isolation and still be non-linearizable — e.g. if two concurrent writes' effects are observed in inconsistent orders by later reads.
Implementation choice — Option B
Both options outlined in the spec were on the table:
TestRapidStateMachineto record an operation history. The problem: rapid drives the actions strictly sequentially, so the history is trivially linearizable and the porcupine check is pure overhead.The model
One register per 4 KiB block (
map[int]uint64— block index → most-recent stamp). Each write embeds a unique 8-byte stamp (from a global atomic counter starting at 1) at the start of its 4 KiB block; reads recover the stamp from the bytes returned. Reads/writes are constrained to a single block at a time so each op is atomic from the model's perspective. Stamp 0 is reserved for "never written", which matches the all-zero bytes the device returns for unwritten blocks.The workload
Call = time.Now()recorded immediately before the syscall.unix.Pread/unix.Pwriteagainst/dev/ublkbNwithO_DIRECTandalignedBuf.Return = time.Now()recorded immediately after.[]porcupine.Operation.porcupine.CheckOperationsVerbose(model, history, 30s). Illegal histories are rendered to a HTML visualization viaporcupine.VisualizePathand the test fails with the path logged.Unknown(timeout) is logged as a soft pass with guidance to either shrink the history or grow the budget.Tunables:
UBLK_LINZ_OPS(default 200) andUBLK_LINZ_WORKERS(default 4).Tooling
make test-linzruns only this test against an integration-tagged binary.test-integrationjob already runs every//go:build integrationtest in./ublk/. Verified in.github/workflows/ci.yml.TODO.md"Linearizability checking" bullet replaced with a(done)summary.github.com/anishathalye/porcupine v1.1.0.fd-close-before-Close discipline
Per
AGENTS.md: the user fd opened on/dev/ublkbNis closed beforedev.Close()(in at.Cleanup), otherwisedel_gendiskblocks waiting for the open ref to drop. Documented inline.Test plan
go vet ./...golangci-lint run ./...→0 issues.go test -count=1 -race ./ublk/uring/ ./ublk/go test -c -tags=integration -o /tmp/ublk.test ./ublk/compilesgofmt -l .emptygo mod tidy -diffcleantest-integrationjob exercisesTestRapidLinearizabilityend-to-end on a host withublk_drv+ root.make test-linzon a kernel host; should pass and loglinearizable: 200 ops checked in <Xs>.Forbidden checks
ublk.go,worker.go,device.go,types.go,ublk/uring/*untouched).fuzz_test.go) and PR ci: bump codecov/codecov-action from 4 to 5 #2 (chaos_integration_test.go) files not modified.t.Skipfor missing root or kernel.TestRapidStateMachineand the rest of PR test(ublk): rapid property-based state-machine tests #27 untouched (additive only).