test(ublk): chaos backend with partial failure rates and latency#26
Conversation
…ction Adds a chaosBackend wrapper (test-only, behind //go:build integration) that wraps any Backend and probabilistically returns unix.EIO and/or inserts uniform random [0, MaxDelay] latency before delegating. The configuration is mutable behind a mutex and backed by a deterministic math/rand/v2 PRNG seeded per-test so failures reproduce. Three integration tests: - TestChaosErrorsPropagateAsEIO: 50% write / 50% read error rate drives ~200 direct-IO block ops; asserts every non-nil error is EIO and observed error fraction lies in [0.30, 0.70] so the wrapper is demonstrably active. - TestChaosCloseTerminatesUnderLatency: 0% errors, MaxDelay=50ms, two concurrent IO goroutines for ~1s; asserts dev.Close() returns within 10s once user fds are closed (AGENTS.md fd-close-before-Close discipline). Mirrors TestCloseAfterBackendErrors. - TestChaosRecovery: 100% write error rate for N writes (each must return EIO), then flips the wrapper to passthrough and asserts N write+read roundtrips return the exact bytes written, verifying no residual corruption from the error phase. Distinct from fault_integration_test.go, which only uses fully-on or fully-off failure modes with no latency injection. Chaos exercises partial failure rates and latency, the realistic failure mode for remote or unreliable storage backends. Tooling: - Makefile: add test-chaos target (runs -test.run=TestChaos on the integration binary) and register it in .PHONY. - TODO.md: replace the "Probabilistic chaos backend" bullet with a "(done)" summary referencing the new file. - No CI changes needed: the existing test-integration job already picks up all //go:build integration tests in ./ublk/. No new module dependencies, no changes to library code.
PR SummaryLow Risk Overview Updates the Makefile with a Reviewed by Cursor Bugbot for commit 9d18e23. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cccab5f. Configure here.
Previously WriteAt/ReadAt read writeErrorRate / readErrorRate without holding the mutex while setRates writes them under it. The race detector would have caught this as soon as TestChaosRecovery called setRates concurrently with in-flight IO. Move the rate read into sampleDecision (already under the mutex) via a chaosOp tag, so the entire decision is taken atomically. Spotted by Cursor Bugbot on #26.
## Summary The Azure-hosted apt mirror (\`azure.archive.ubuntu.com\`) returns \`Temporary failure resolving\` periodically — observed today on PRs #24, #26, and #27, all blocked by the same DNS blip while installing \`linux-modules-extra\`. This PR wraps the \`apt-get update\` + install in a 5-attempt exponential backoff (5s, 10s, 20s, 40s, 80s — ~155s worst case before giving up). The fast path (\`modprobe ublk_drv\` succeeding because the module is already present in the runner image) is unchanged. No code changes; CI workflow only. ## Test plan - [ ] CI run for this branch loads ublk_drv and runs integration tests cleanly. - [ ] If the apt mirror flakes during the run, the retry should absorb it; logs will show \`attempt N failed; sleeping...\`.
matthewlouisbrockman
left a comment
There was a problem hiding this comment.
g2g after the merge conflicts
Keep the new rapid-test helpers alongside the chaos test helpers and preserve both completed TODO removals so the branch matches current main cleanly.
Bring in the merged uring fuzz helpers alongside the chaos and rapid test targets so the branch stays conflict-free on the current main.

Summary
Adds a
chaosBackendtest wrapper (integration build tag) that wraps anyBackendand probabilistically returnsunix.EIOand/or injects uniform random[0, MaxDelay]latency before delegating to the wrapped implementation. Configuration is mutable behind a mutex, backed by a deterministicmath/rand/v2PRNG seeded per-test for reproducibility, with atomic counters forWrites,Reads,WriteErrs,ReadErrs.No library code changes; this is purely a test-side addition.
Tests
TestChaosErrorsPropagateAsEIO— 50% write / 50% read error rate, no latency; drives ~200 direct-IO block ops and asserts every non-nil error isEIOand the observed error fraction lands in[0.30, 0.70]so the wrapper is demonstrably active.TestChaosCloseTerminatesUnderLatency— 0% errors,MaxDelay = 50ms, two concurrent IO goroutines for ~1s; assertsdev.Close()returns within 10s after user fds are closed (respects AGENTS.md fd-close-before-Close discipline; mirrorsTestCloseAfterBackendErrors).TestChaosRecovery— 100% write error rate for N writes (each must return EIO), then flips the wrapper to passthrough and asserts N write+read roundtrips return the exact bytes written, catching any residual corruption from the error phase.Why distinct from
fault_integration_test.goThe existing fault tests only exercise fully-on or fully-off failure modes with a static config and no latency injection. Chaos adds partial failure rates and per-call latency — the realistic failure mode for remote or unreliable storage backends — and validates clean-shutdown and recovery semantics in that regime.
Tooling
Makefile: newtest-chaostarget runs only-test.run=TestChaosagainst the integration binary; added to.PHONY.TODO.md: replaced the "Probabilistic chaos backend" bullet with a(done)summary referencingublk/chaos_integration_test.go.test-integrationjob already builds the integration binary and runs all tests under theintegrationbuild tag, so the new tests are picked up automatically.Test plan
go vet ./...golangci-lint run ./...→ 0 issuesgo test -count=1 -race ./ublk/uring/ ./ublk/(chaos tests skipped without integration tag)go test -c -tags=integration -o /tmp/ublk.test ./ublk/(compile-only; CI runs it as root)gofmt -l .emptygo mod tidy -diffcleantest-integrationjob passes on both amd64 and arm64