Skip to content

test(ublk): porcupine linearizability on concurrent reads/writes#28

Merged
ValentaTomas merged 9 commits into
mainfrom
tests/porcupine-linearizability
May 4, 2026
Merged

test(ublk): porcupine linearizability on concurrent reads/writes#28
ValentaTomas merged 9 commits into
mainfrom
tests/porcupine-linearizability

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 2, 2026

Stacked on: #27 (tests/rapid-state-machine).

Adds TestRapidLinearizability (in ublk/porcupine_integration_test.go) — a porcupine-driven linearizability checker for concurrent reads and writes against a single ublk device.

Why this catches things rapid alone doesn't

TestRapidStateMachine (PR #27) checks a per-operation invariant: after every action, the device's bytes match the model's shadow. That's sufficient when the test driver issues actions sequentially — which it does, because rapid state machines are sequential by construction.

This PR addresses a different question: can the global real-time history of concurrent ops be explained by some valid sequential ordering? That's linearizability, the same property Jepsen checks for distributed databases. A history can pass the per-operation invariant on every read taken in isolation and still be non-linearizable — e.g. if two concurrent writes' effects are observed in inconsistent orders by later reads.

Implementation choice — Option B

Both options outlined in the spec were on the table:

  • Option A would have instrumented TestRapidStateMachine to record an operation history. The problem: rapid drives the actions strictly sequentially, so the history is trivially linearizable and the porcupine check is pure overhead.
  • Option B (chosen): a standalone test in its own file driving a concurrent worker pool. The rapid state machine is preserved untouched as the per-operation invariant checker; this test is the global-ordering checker. Cleaner separation of concerns and avoids forcing PR test(ublk): rapid property-based state-machine tests #27's test into a shape it doesn't want.

The model

One register per 4 KiB block (map[int]uint64 — block index → most-recent stamp). Each write embeds a unique 8-byte stamp (from a global atomic counter starting at 1) at the start of its 4 KiB block; reads recover the stamp from the bytes returned. Reads/writes are constrained to a single block at a time so each op is atomic from the model's perspective. Stamp 0 is reserved for "never written", which matches the all-zero bytes the device returns for unwritten blocks.

The workload

  • Single 256 KiB device (64 blocks of 4 KiB).
  • Default 4 concurrent goroutines × 50 ops each (200 ops total).
  • Each op:
    • Call = time.Now() recorded immediately before the syscall.
    • Issues unix.Pread/unix.Pwrite against /dev/ublkbN with O_DIRECT and alignedBuf.
    • Return = time.Now() recorded immediately after.
    • Appended to a mutex-protected []porcupine.Operation.
  • After the workload phase: porcupine.CheckOperationsVerbose(model, history, 30s). Illegal histories are rendered to a HTML visualization via porcupine.VisualizePath and the test fails with the path logged. Unknown (timeout) is logged as a soft pass with guidance to either shrink the history or grow the budget.

Tunables: UBLK_LINZ_OPS (default 200) and UBLK_LINZ_WORKERS (default 4).

Tooling

  • make test-linz runs only this test against an integration-tagged binary.
  • No new CI job needed: the existing test-integration job already runs every //go:build integration test in ./ublk/. Verified in .github/workflows/ci.yml.
  • TODO.md "Linearizability checking" bullet replaced with a (done) summary.
  • Pinned dependency: github.com/anishathalye/porcupine v1.1.0.

fd-close-before-Close discipline

Per AGENTS.md: the user fd opened on /dev/ublkbN is closed before dev.Close() (in a t.Cleanup), otherwise del_gendisk blocks waiting for the open ref to drop. Documented inline.

Test plan

  • go vet ./...
  • golangci-lint run ./...0 issues.
  • go test -count=1 -race ./ublk/uring/ ./ublk/
  • go test -c -tags=integration -o /tmp/ublk.test ./ublk/ compiles
  • gofmt -l . empty
  • go mod tidy -diff clean
  • CI's test-integration job exercises TestRapidLinearizability end-to-end on a host with ublk_drv + root.
  • (Optional) Manual sanity check: make test-linz on a kernel host; should pass and log linearizable: 200 ops checked in <Xs>.

Forbidden checks

…ycle and isolation

Adds pgregory.net/rapid v1.3.0 as a test dependency and a new
TestRapidStateMachine in ublk/rapid_integration_test.go.

The state machine drives random sequences of create/write/read/fsync/
close actions against up to two live ublk devices and an in-process
shadow model. Invariants checked after every action:

  1. Read returns bytes from the most recent Write (per device).
  2. Bytes written to device A never appear at the same offset on
     device B (cross-device isolation).
  3. Close terminates within a 5 s timer (a hang in del_gendisk would
     otherwise deadlock the test rather than report a failure).
  4. Close is idempotent — a second call must not panic or hang.

User fds on /dev/ublkbN are closed before Device.Close (AGENTS.md
fd-close-before-Close discipline) so del_gendisk does not block.

Adds a make target test-rapid for filtered local iteration. The new
test runs as part of the existing test-integration CI job; no new CI
job is needed. TODO.md's "Property-based / model-based state machine
tests (rapid)" item is marked done with a pointer to the new file.
@ValentaTomas ValentaTomas requested a review from arkamar as a code owner May 2, 2026 00:19
@cursor
Copy link
Copy Markdown

cursor Bot commented May 2, 2026

PR Summary

Low Risk
Low risk because changes are limited to tests/tooling and a new test-only dependency, with no production code changes. Main risk is longer or flaky integration runs due to timing- and concurrency-sensitive checking.

Overview
Adds an integration-tagged TestPorcupineLinearizability that drives concurrent pread/pwrite workloads against a ublk device, records a real-time history, and checks it with github.com/anishathalye/porcupine (with HTML visualization on failure and a soft-pass on checker timeout).

Updates go.mod/go.sum to include Porcupine, adds make test-linz to run only this test, and removes the corresponding now-completed linearizability TODO entry.

Reviewed by Cursor Bugbot for commit 94bf7da. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread ublk/porcupine_integration_test.go Outdated
The rapidMaxCreates cap was unconditional, so once a Run hit it AND
the last live device was closed, every subsequent action skipped
(create for the cap, read/write/fsync/close for 'no live devices').
rapid then fails the Run with 'can't find a valid (non-skipped)
action'. Reproduced on CI: --- FAIL: TestRapidStateMachine after
~30s of churn that closed all devices and tried to grow past the
cap.

Lift the cap when len(live) == 0 so create is always available as a
recovery path. The cap still bounds runtime in the common case
where at least one device is alive.
@ValentaTomas ValentaTomas force-pushed the tests/porcupine-linearizability branch from e6f5a16 to 45d783e Compare May 2, 2026 00:39
… workload

Adds TestRapidLinearizability — a concurrent worker-pool harness that
records a wall-clock history of pread/pwrite ops against a single
device and feeds it to anishathalye/porcupine to decide whether the
history admits a valid sequential explanation.

The model is one register per 4 KiB block (map[int]uint64, block →
most-recent stamp). Each write embeds a unique 8-byte stamp at the
start of its block; reads recover it. Default workload: 4 workers,
200 ops total, 256 KiB device (64 blocks). Tunable via UBLK_LINZ_OPS
and UBLK_LINZ_WORKERS.

Implemented in a separate file rather than instrumenting the existing
TestRapidStateMachine because rapid drives a strictly sequential
state machine — a sequential history is trivially linearizable. The
new test runs alongside the rapid one in the existing test-integration
CI job (same `integration` build tag), and ships a `make test-linz`
target for iterating on the model or workload in isolation.
…ility

Bugbot pointed out that `TestRapidLinearizability` was being captured
by the existing `test-rapid` Makefile target's `-test.run=TestRapid`
regex even though the test does not use the rapid library. The
target's stated purpose is "run only the rapid property-based
state-machine tests" for quick iteration on shrunk failures, so
having porcupine sneak in adds up to 60s of unrelated work to that
fast loop.

Rename the function (and its doc comment) so it no longer matches
the TestRapid prefix, and update the `test-linz` Makefile target's
`-test.run` regex accordingly. Reference to TestRapidStateMachine
inside the doc comment is intentional — it is the sequential
companion test, still named that way, and porcupine remains its
"global ordering" follow-up.

Reported by Cursor Bugbot on PR #28.
@ValentaTomas ValentaTomas force-pushed the tests/porcupine-linearizability branch from 45d783e to 5698aca Compare May 2, 2026 00:51
Base automatically changed from tests/rapid-state-machine to main May 4, 2026 17:44
@ValentaTomas ValentaTomas enabled auto-merge (squash) May 4, 2026 17:46
Keep the porcupine target and dependency alongside the newer rapid test support so the branch matches current main and merges cleanly.
Bring in the merged uring fuzz helpers alongside the porcupine and rapid test targets so the branch stays conflict-free on the current main.
Bring in the merged chaos test helpers alongside the porcupine, rapid, and fuzz helpers so the branch stays conflict-free on the current main.
@ValentaTomas ValentaTomas merged commit 998c5d1 into main May 4, 2026
13 checks passed
@ValentaTomas ValentaTomas deleted the tests/porcupine-linearizability branch May 4, 2026 17:58
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Reviewed by Cursor Bugbot for commit 94bf7da. Configure here.

// fail — porcupine is NP-hard and false-timeouts are real.
t.Logf("linearizability check TIMED OUT after %v (history len=%d) "+
"— increase UBLK_LINZ_OPS to keep histories small, or "+
"raise the checkTimeout in this test", checkTimeout, len(history))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout guidance says "increase" but means "decrease"

Low Severity

The timeout log message says "increase UBLK_LINZ_OPS to keep histories small" but UBLK_LINZ_OPS controls the total number of operations — increasing it makes the history larger, making timeouts more likely. The guidance is inverted and would lead a developer to worsen the problem. It needs to say "decrease".

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 94bf7da. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants