Skip to content

test(ublk): rapid property-based state-machine tests#27

Merged
ValentaTomas merged 3 commits into
mainfrom
tests/rapid-state-machine
May 4, 2026
Merged

test(ublk): rapid property-based state-machine tests#27
ValentaTomas merged 3 commits into
mainfrom
tests/rapid-state-machine

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 2, 2026

Summary

Adds pgregory.net/rapid v1.3.0 as a test dependency and a new TestRapidStateMachine integration test in ublk/rapid_integration_test.go.

The state machine uses rapid's t.Repeat API to generate pseudo-random sequences of five actions:

  • createublk.New a fresh device (capped at 2 live, 16 total per Run, ~10 ms each on a warm host)
  • writepwrite(O_DIRECT) random data at a block-aligned offset on a randomly chosen live device, then update the per-device shadow
  • readpread(O_DIRECT) and assert bytes equal the per-device shadow
  • fsyncunix.Fsync on a live device
  • close — full Close cycle: close the user fd first, run dev.Close() under a 5 s timer, then call dev.Close() a second time and assert it does not hang or error

A separate empty-key handler runs after every action and probes one block on every live device against the shadow.

Constraints (kept tight to make each Run cheap):

  • block size 4096, device size 2 MiB
  • write/read lengths drawn from {512, 4096, 8192}, offsets always block-aligned
  • max 2 devices alive simultaneously, 16 creates per Run

Invariants checked

  1. A Read returns bytes from the most recent Write at that range (per device).
  2. Bytes written to device A never appear at the same offset on device B (cross-device isolation — falls out of (1) since each device has an independent shadow that the per-action invariant probes against).
  3. Close terminates within 5 s (closeWithDeadline runs Close in a goroutine and times out via select — a hang in del_gendisk would otherwise deadlock rather than report a shrinkable failure).
  4. Close is idempotent — a second call to Close must not panic or hang (mirrors TestCloseIdempotent).

fd-close-before-Close discipline

Per AGENTS.md: any test that opens /dev/ublkbN must close that fd before calling dev.Close(), otherwise del_gendisk blocks indefinitely. The state machine's closeDevice action does unix.Close(fd) first, and the per-Run cleanup (defer) does the same for any devices left alive at the end of a Run.

Why this is distinct from TestTortureRandomIO

TestTortureRandomIO is a long-running soak with fixed structure — N workers on disjoint regions of one device, doing random I/O within their region for the duration of the run. It does not exercise lifecycle transitions (only one device, no close mid-stream), and when it fails it gives you the failing op as-is — no minimization.

TestRapidStateMachine generates arbitrary command sequences including lifecycle transitions (create / close mid-stream, multiple devices), and rapid's automatic shrinking reduces any failing sequence to a minimal reproducer. That's the primary value: a 1000-action failing case shrinks to (typically) a handful of actions you can stare at.

Tooling

  • make test-rapid — builds the integration binary and runs only TestRapid*. For local iteration on a shrunk failing case.
  • The test runs as part of the existing test-integration CI job; no new CI job needed.
  • Tunable via standard rapid flags / env vars: -rapid.checks=N or RAPID_CHECKS=N. See go test -args -h for the full list.

Test plan

  • go vet ./... — clean
  • golangci-lint run ./...0 issues.
  • go test -count=1 -race ./ublk/uring/ ./ublk/ — pass
  • go test -c -tags=integration -o /tmp/ublk.test ./ublk/ — compiles
  • gofmt -l . — empty
  • go mod tidy — clean
  • Integration run on host with root + ublk_drv: deferred to CI's test-integration job (requires kernel module + privilege not available in the dev sandbox).

…ycle and isolation

Adds pgregory.net/rapid v1.3.0 as a test dependency and a new
TestRapidStateMachine in ublk/rapid_integration_test.go.

The state machine drives random sequences of create/write/read/fsync/
close actions against up to two live ublk devices and an in-process
shadow model. Invariants checked after every action:

  1. Read returns bytes from the most recent Write (per device).
  2. Bytes written to device A never appear at the same offset on
     device B (cross-device isolation).
  3. Close terminates within a 5 s timer (a hang in del_gendisk would
     otherwise deadlock the test rather than report a failure).
  4. Close is idempotent — a second call must not panic or hang.

User fds on /dev/ublkbN are closed before Device.Close (AGENTS.md
fd-close-before-Close discipline) so del_gendisk does not block.

Adds a make target test-rapid for filtered local iteration. The new
test runs as part of the existing test-integration CI job; no new CI
job is needed. TODO.md's "Property-based / model-based state machine
tests (rapid)" item is marked done with a pointer to the new file.
@ValentaTomas ValentaTomas requested a review from arkamar as a code owner May 2, 2026 00:11
@cursor
Copy link
Copy Markdown

cursor Bot commented May 2, 2026

PR Summary

Medium Risk
Adds a randomized, property-based integration test that runs as root against real ublk devices; this can increase CI runtime/flakiness and exercises device create/close paths that can hang if assumptions change.

Overview
Adds pgregory.net/rapid as a dependency and introduces a new TestRapidStateMachine integration test that generates/shrinks random sequences of create/write/read/fsync/close operations while checking a per-device shadow model, cross-device isolation, bounded Close() completion, and Close() idempotency.

Extends developer tooling with make test-rapid to run only the new TestRapid* tests, and updates TODO.md by removing the now-completed rapid testing item.

Reviewed by Cursor Bugbot for commit ef6a719. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 02ac51e. Configure here.

Comment thread ublk/rapid_integration_test.go
The rapidMaxCreates cap was unconditional, so once a Run hit it AND
the last live device was closed, every subsequent action skipped
(create for the cap, read/write/fsync/close for 'no live devices').
rapid then fails the Run with 'can't find a valid (non-skipped)
action'. Reproduced on CI: --- FAIL: TestRapidStateMachine after
~30s of churn that closed all devices and tried to grow past the
cap.

Lift the cap when len(live) == 0 so create is always available as a
recovery path. The cap still bounds runtime in the common case
where at least one device is alive.
ValentaTomas added a commit that referenced this pull request May 4, 2026
## Summary

The Azure-hosted apt mirror (\`azure.archive.ubuntu.com\`) returns
\`Temporary failure resolving\` periodically — observed today on PRs
#24, #26, and #27, all blocked by the same DNS blip while installing
\`linux-modules-extra\`.

This PR wraps the \`apt-get update\` + install in a 5-attempt
exponential backoff (5s, 10s, 20s, 40s, 80s — ~155s worst case before
giving up). The fast path (\`modprobe ublk_drv\` succeeding because the
module is already present in the runner image) is unchanged.

No code changes; CI workflow only.

## Test plan

- [ ] CI run for this branch loads ublk_drv and runs integration tests
cleanly.
- [ ] If the apt mirror flakes during the run, the retry should absorb
it; logs will show \`attempt N failed; sleeping...\`.
@ValentaTomas ValentaTomas merged commit 197e045 into main May 4, 2026
12 checks passed
@ValentaTomas ValentaTomas deleted the tests/rapid-state-machine branch May 4, 2026 17:44
ValentaTomas added a commit that referenced this pull request May 4, 2026
**Stacked on:** #27 (`tests/rapid-state-machine`).

Adds `TestRapidLinearizability` (in
`ublk/porcupine_integration_test.go`) — a porcupine-driven
linearizability checker for concurrent reads and writes against a single
ublk device.

## Why this catches things rapid alone doesn't

`TestRapidStateMachine` (PR #27) checks a per-operation invariant: after
every action, the device's bytes match the model's shadow. That's
sufficient when the test driver issues actions sequentially — which it
does, because rapid state machines are sequential by construction.

This PR addresses a different question: **can the global real-time
history of concurrent ops be explained by some valid sequential
ordering?** That's linearizability, the same property Jepsen checks for
distributed databases. A history can pass the per-operation invariant on
every read taken in isolation and still be non-linearizable — e.g. if
two concurrent writes' effects are observed in inconsistent orders by
later reads.

## Implementation choice — Option B

Both options outlined in the spec were on the table:

- **Option A** would have instrumented `TestRapidStateMachine` to record
an operation history. The problem: rapid drives the actions strictly
sequentially, so the history is trivially linearizable and the porcupine
check is pure overhead.
- **Option B** (chosen): a standalone test in its own file driving a
concurrent worker pool. The rapid state machine is preserved untouched
as the per-operation invariant checker; this test is the global-ordering
checker. Cleaner separation of concerns and avoids forcing PR #27's test
into a shape it doesn't want.

## The model

One register per 4 KiB block (`map[int]uint64` — block index →
most-recent stamp). Each write embeds a unique 8-byte stamp (from a
global atomic counter starting at 1) at the start of its 4 KiB block;
reads recover the stamp from the bytes returned. Reads/writes are
constrained to a single block at a time so each op is atomic from the
model's perspective. Stamp 0 is reserved for "never written", which
matches the all-zero bytes the device returns for unwritten blocks.

## The workload

- Single 256 KiB device (64 blocks of 4 KiB).
- Default 4 concurrent goroutines × 50 ops each (200 ops total).
- Each op:
  - `Call = time.Now()` recorded immediately before the syscall.
- Issues `unix.Pread`/`unix.Pwrite` against `/dev/ublkbN` with
`O_DIRECT` and `alignedBuf`.
  - `Return = time.Now()` recorded immediately after.
  - Appended to a mutex-protected `[]porcupine.Operation`.
- After the workload phase: `porcupine.CheckOperationsVerbose(model,
history, 30s)`. Illegal histories are rendered to a HTML visualization
via `porcupine.VisualizePath` and the test fails with the path logged.
`Unknown` (timeout) is logged as a soft pass with guidance to either
shrink the history or grow the budget.

Tunables: `UBLK_LINZ_OPS` (default 200) and `UBLK_LINZ_WORKERS` (default
4).

## Tooling

- `make test-linz` runs only this test against an integration-tagged
binary.
- No new CI job needed: the existing `test-integration` job already runs
every `//go:build integration` test in `./ublk/`. Verified in
`.github/workflows/ci.yml`.
- `TODO.md` "Linearizability checking" bullet replaced with a `(done)`
summary.
- Pinned dependency: `github.com/anishathalye/porcupine v1.1.0`.

## fd-close-before-Close discipline

Per `AGENTS.md`: the user fd opened on `/dev/ublkbN` is closed before
`dev.Close()` (in a `t.Cleanup`), otherwise `del_gendisk` blocks waiting
for the open ref to drop. Documented inline.

## Test plan

- [x] `go vet ./...`
- [x] `golangci-lint run ./...` → `0 issues.`
- [x] `go test -count=1 -race ./ublk/uring/ ./ublk/`
- [x] `go test -c -tags=integration -o /tmp/ublk.test ./ublk/` compiles
- [x] `gofmt -l .` empty
- [x] `go mod tidy -diff` clean
- [x] CI's `test-integration` job exercises `TestRapidLinearizability`
end-to-end on a host with `ublk_drv` + root.
- [ ] (Optional) Manual sanity check: `make test-linz` on a kernel host;
should pass and log `linearizable: 200 ops checked in <Xs>`.

## Forbidden checks

- No library code touched (`ublk.go`, `worker.go`, `device.go`,
`types.go`, `ublk/uring/*` untouched).
- PR #1 (`fuzz_test.go`) and PR #2 (`chaos_integration_test.go`) files
not modified.
- No `t.Skip` for missing root or kernel.
- `TestRapidStateMachine` and the rest of PR #27 untouched (additive
only).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants