test(ublk): rapid property-based state-machine tests by ValentaTomas · Pull Request #27 · e2b-dev/ublk-go

ValentaTomas · 2026-05-02T00:11:05Z

Summary

Adds pgregory.net/rapid v1.3.0 as a test dependency and a new TestRapidStateMachine integration test in ublk/rapid_integration_test.go.

The state machine uses rapid's t.Repeat API to generate pseudo-random sequences of five actions:

create — ublk.New a fresh device (capped at 2 live, 16 total per Run, ~10 ms each on a warm host)
write — pwrite(O_DIRECT) random data at a block-aligned offset on a randomly chosen live device, then update the per-device shadow
read — pread(O_DIRECT) and assert bytes equal the per-device shadow
fsync — unix.Fsync on a live device
close — full Close cycle: close the user fd first, run dev.Close() under a 5 s timer, then call dev.Close() a second time and assert it does not hang or error

A separate empty-key handler runs after every action and probes one block on every live device against the shadow.

Constraints (kept tight to make each Run cheap):

block size 4096, device size 2 MiB
write/read lengths drawn from {512, 4096, 8192}, offsets always block-aligned
max 2 devices alive simultaneously, 16 creates per Run

Invariants checked

A Read returns bytes from the most recent Write at that range (per device).
Bytes written to device A never appear at the same offset on device B (cross-device isolation — falls out of (1) since each device has an independent shadow that the per-action invariant probes against).
Close terminates within 5 s (closeWithDeadline runs Close in a goroutine and times out via select — a hang in del_gendisk would otherwise deadlock rather than report a shrinkable failure).
Close is idempotent — a second call to Close must not panic or hang (mirrors TestCloseIdempotent).

fd-close-before-Close discipline

Per AGENTS.md: any test that opens /dev/ublkbN must close that fd before calling dev.Close(), otherwise del_gendisk blocks indefinitely. The state machine's closeDevice action does unix.Close(fd) first, and the per-Run cleanup (defer) does the same for any devices left alive at the end of a Run.

Why this is distinct from `TestTortureRandomIO`

TestTortureRandomIO is a long-running soak with fixed structure — N workers on disjoint regions of one device, doing random I/O within their region for the duration of the run. It does not exercise lifecycle transitions (only one device, no close mid-stream), and when it fails it gives you the failing op as-is — no minimization.

TestRapidStateMachine generates arbitrary command sequences including lifecycle transitions (create / close mid-stream, multiple devices), and rapid's automatic shrinking reduces any failing sequence to a minimal reproducer. That's the primary value: a 1000-action failing case shrinks to (typically) a handful of actions you can stare at.

Tooling

make test-rapid — builds the integration binary and runs only TestRapid*. For local iteration on a shrunk failing case.
The test runs as part of the existing test-integration CI job; no new CI job needed.
Tunable via standard rapid flags / env vars: -rapid.checks=N or RAPID_CHECKS=N. See go test -args -h for the full list.

Test plan

go vet ./... — clean
golangci-lint run ./... — 0 issues.
go test -count=1 -race ./ublk/uring/ ./ublk/ — pass
go test -c -tags=integration -o /tmp/ublk.test ./ublk/ — compiles
gofmt -l . — empty
go mod tidy — clean
Integration run on host with root + ublk_drv: deferred to CI's test-integration job (requires kernel module + privilege not available in the dev sandbox).

…ycle and isolation Adds pgregory.net/rapid v1.3.0 as a test dependency and a new TestRapidStateMachine in ublk/rapid_integration_test.go. The state machine drives random sequences of create/write/read/fsync/ close actions against up to two live ublk devices and an in-process shadow model. Invariants checked after every action: 1. Read returns bytes from the most recent Write (per device). 2. Bytes written to device A never appear at the same offset on device B (cross-device isolation). 3. Close terminates within a 5 s timer (a hang in del_gendisk would otherwise deadlock the test rather than report a failure). 4. Close is idempotent — a second call must not panic or hang. User fds on /dev/ublkbN are closed before Device.Close (AGENTS.md fd-close-before-Close discipline) so del_gendisk does not block. Adds a make target test-rapid for filtered local iteration. The new test runs as part of the existing test-integration CI job; no new CI job is needed. TODO.md's "Property-based / model-based state machine tests (rapid)" item is marked done with a pointer to the new file.

cursor · 2026-05-02T00:11:09Z

PR Summary

Medium Risk
Adds a randomized, property-based integration test that runs as root against real ublk devices; this can increase CI runtime/flakiness and exercises device create/close paths that can hang if assumptions change.

Overview
Adds pgregory.net/rapid as a dependency and introduces a new TestRapidStateMachine integration test that generates/shrinks random sequences of create/write/read/fsync/close operations while checking a per-device shadow model, cross-device isolation, bounded Close() completion, and Close() idempotency.

Extends developer tooling with make test-rapid to run only the new TestRapid* tests, and updates TODO.md by removing the now-completed rapid testing item.

^{Reviewed by Cursor Bugbot for commit ef6a719. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 02ac51e. Configure here.}

The rapidMaxCreates cap was unconditional, so once a Run hit it AND the last live device was closed, every subsequent action skipped (create for the cap, read/write/fsync/close for 'no live devices'). rapid then fails the Run with 'can't find a valid (non-skipped) action'. Reproduced on CI: --- FAIL: TestRapidStateMachine after ~30s of churn that closed all devices and tried to grow past the cap. Lift the cap when len(live) == 0 so create is always available as a recovery path. The cap still bounds runtime in the common case where at least one device is alive.

…s (rapid)" entry

## Summary The Azure-hosted apt mirror (\`azure.archive.ubuntu.com\`) returns \`Temporary failure resolving\` periodically — observed today on PRs #24, #26, and #27, all blocked by the same DNS blip while installing \`linux-modules-extra\`. This PR wraps the \`apt-get update\` + install in a 5-attempt exponential backoff (5s, 10s, 20s, 40s, 80s — ~155s worst case before giving up). The fast path (\`modprobe ublk_drv\` succeeding because the module is already present in the runner image) is unchanged. No code changes; CI workflow only. ## Test plan - [ ] CI run for this branch loads ublk_drv and runs integration tests cleanly. - [ ] If the apt mirror flakes during the run, the retry should absorb it; logs will show \`attempt N failed; sleeping...\`.

**Stacked on:** #27 (`tests/rapid-state-machine`). Adds `TestRapidLinearizability` (in `ublk/porcupine_integration_test.go`) — a porcupine-driven linearizability checker for concurrent reads and writes against a single ublk device. ## Why this catches things rapid alone doesn't `TestRapidStateMachine` (PR #27) checks a per-operation invariant: after every action, the device's bytes match the model's shadow. That's sufficient when the test driver issues actions sequentially — which it does, because rapid state machines are sequential by construction. This PR addresses a different question: **can the global real-time history of concurrent ops be explained by some valid sequential ordering?** That's linearizability, the same property Jepsen checks for distributed databases. A history can pass the per-operation invariant on every read taken in isolation and still be non-linearizable — e.g. if two concurrent writes' effects are observed in inconsistent orders by later reads. ## Implementation choice — Option B Both options outlined in the spec were on the table: - **Option A** would have instrumented `TestRapidStateMachine` to record an operation history. The problem: rapid drives the actions strictly sequentially, so the history is trivially linearizable and the porcupine check is pure overhead. - **Option B** (chosen): a standalone test in its own file driving a concurrent worker pool. The rapid state machine is preserved untouched as the per-operation invariant checker; this test is the global-ordering checker. Cleaner separation of concerns and avoids forcing PR #27's test into a shape it doesn't want. ## The model One register per 4 KiB block (`map[int]uint64` — block index → most-recent stamp). Each write embeds a unique 8-byte stamp (from a global atomic counter starting at 1) at the start of its 4 KiB block; reads recover the stamp from the bytes returned. Reads/writes are constrained to a single block at a time so each op is atomic from the model's perspective. Stamp 0 is reserved for "never written", which matches the all-zero bytes the device returns for unwritten blocks. ## The workload - Single 256 KiB device (64 blocks of 4 KiB). - Default 4 concurrent goroutines × 50 ops each (200 ops total). - Each op: - `Call = time.Now()` recorded immediately before the syscall. - Issues `unix.Pread`/`unix.Pwrite` against `/dev/ublkbN` with `O_DIRECT` and `alignedBuf`. - `Return = time.Now()` recorded immediately after. - Appended to a mutex-protected `[]porcupine.Operation`. - After the workload phase: `porcupine.CheckOperationsVerbose(model, history, 30s)`. Illegal histories are rendered to a HTML visualization via `porcupine.VisualizePath` and the test fails with the path logged. `Unknown` (timeout) is logged as a soft pass with guidance to either shrink the history or grow the budget. Tunables: `UBLK_LINZ_OPS` (default 200) and `UBLK_LINZ_WORKERS` (default 4). ## Tooling - `make test-linz` runs only this test against an integration-tagged binary. - No new CI job needed: the existing `test-integration` job already runs every `//go:build integration` test in `./ublk/`. Verified in `.github/workflows/ci.yml`. - `TODO.md` "Linearizability checking" bullet replaced with a `(done)` summary. - Pinned dependency: `github.com/anishathalye/porcupine v1.1.0`. ## fd-close-before-Close discipline Per `AGENTS.md`: the user fd opened on `/dev/ublkbN` is closed before `dev.Close()` (in a `t.Cleanup`), otherwise `del_gendisk` blocks waiting for the open ref to drop. Documented inline. ## Test plan - [x] `go vet ./...` - [x] `golangci-lint run ./...` → `0 issues.` - [x] `go test -count=1 -race ./ublk/uring/ ./ublk/` - [x] `go test -c -tags=integration -o /tmp/ublk.test ./ublk/` compiles - [x] `gofmt -l .` empty - [x] `go mod tidy -diff` clean - [x] CI's `test-integration` job exercises `TestRapidLinearizability` end-to-end on a host with `ublk_drv` + root. - [ ] (Optional) Manual sanity check: `make test-linz` on a kernel host; should pass and log `linearizable: 200 ops checked in <Xs>`. ## Forbidden checks - No library code touched (`ublk.go`, `worker.go`, `device.go`, `types.go`, `ublk/uring/*` untouched). - PR #1 (`fuzz_test.go`) and PR #2 (`chaos_integration_test.go`) files not modified. - No `t.Skip` for missing root or kernel. - `TestRapidStateMachine` and the rest of PR #27 untouched (additive only).

ValentaTomas requested a review from arkamar as a code owner May 2, 2026 00:11

ValentaTomas mentioned this pull request May 2, 2026

test(ublk): porcupine linearizability on concurrent reads/writes #28

Merged

8 tasks

cursor Bot reviewed May 2, 2026

View reviewed changes

Comment thread ublk/rapid_integration_test.go

ValentaTomas mentioned this pull request May 2, 2026

ci: retry apt-get for ublk_drv install (Azure mirror flake) #29

Merged

2 tasks

ValentaTomas enabled auto-merge (squash) May 2, 2026 00:24

ValentaTomas added 2 commits May 1, 2026 17:39

docs(TODO): drop now-implemented "Property-based / state machine test…

ef6a719

…s (rapid)" entry

matthewlouisbrockman approved these changes May 4, 2026

View reviewed changes

ValentaTomas merged commit 197e045 into main May 4, 2026
12 checks passed

ValentaTomas deleted the tests/rapid-state-machine branch May 4, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(ublk): rapid property-based state-machine tests#27

test(ublk): rapid property-based state-machine tests#27
ValentaTomas merged 3 commits into
mainfrom
tests/rapid-state-machine

ValentaTomas commented May 2, 2026 •

edited

Loading

Uh oh!

cursor Bot commented May 2, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ValentaTomas commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Invariants checked

fd-close-before-Close discipline

Why this is distinct from TestTortureRandomIO

Tooling

Test plan

Uh oh!

cursor Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ValentaTomas commented May 2, 2026 •

edited

Loading

Why this is distinct from `TestTortureRandomIO`

cursor Bot commented May 2, 2026 •

edited

Loading