Skip to content

test(ublk): chaos backend with partial failure rates and latency#26

Merged
ValentaTomas merged 5 commits into
mainfrom
tests/chaos-backend
May 4, 2026
Merged

test(ublk): chaos backend with partial failure rates and latency#26
ValentaTomas merged 5 commits into
mainfrom
tests/chaos-backend

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

@ValentaTomas ValentaTomas commented May 2, 2026

Summary

Adds a chaosBackend test wrapper (integration build tag) that wraps any Backend and probabilistically returns unix.EIO and/or injects uniform random [0, MaxDelay] latency before delegating to the wrapped implementation. Configuration is mutable behind a mutex, backed by a deterministic math/rand/v2 PRNG seeded per-test for reproducibility, with atomic counters for Writes, Reads, WriteErrs, ReadErrs.

No library code changes; this is purely a test-side addition.

Tests

  • TestChaosErrorsPropagateAsEIO — 50% write / 50% read error rate, no latency; drives ~200 direct-IO block ops and asserts every non-nil error is EIO and the observed error fraction lands in [0.30, 0.70] so the wrapper is demonstrably active.
  • TestChaosCloseTerminatesUnderLatency — 0% errors, MaxDelay = 50ms, two concurrent IO goroutines for ~1s; asserts dev.Close() returns within 10s after user fds are closed (respects AGENTS.md fd-close-before-Close discipline; mirrors TestCloseAfterBackendErrors).
  • TestChaosRecovery — 100% write error rate for N writes (each must return EIO), then flips the wrapper to passthrough and asserts N write+read roundtrips return the exact bytes written, catching any residual corruption from the error phase.

Why distinct from fault_integration_test.go

The existing fault tests only exercise fully-on or fully-off failure modes with a static config and no latency injection. Chaos adds partial failure rates and per-call latency — the realistic failure mode for remote or unreliable storage backends — and validates clean-shutdown and recovery semantics in that regime.

Tooling

  • Makefile: new test-chaos target runs only -test.run=TestChaos against the integration binary; added to .PHONY.
  • TODO.md: replaced the "Probabilistic chaos backend" bullet with a (done) summary referencing ublk/chaos_integration_test.go.
  • No CI changes: the existing test-integration job already builds the integration binary and runs all tests under the integration build tag, so the new tests are picked up automatically.

Test plan

  • go vet ./...
  • golangci-lint run ./... → 0 issues
  • go test -count=1 -race ./ublk/uring/ ./ublk/ (chaos tests skipped without integration tag)
  • go test -c -tags=integration -o /tmp/ublk.test ./ublk/ (compile-only; CI runs it as root)
  • gofmt -l . empty
  • go mod tidy -diff clean
  • CI test-integration job passes on both amd64 and arm64

…ction

Adds a chaosBackend wrapper (test-only, behind //go:build integration)
that wraps any Backend and probabilistically returns unix.EIO and/or
inserts uniform random [0, MaxDelay] latency before delegating. The
configuration is mutable behind a mutex and backed by a deterministic
math/rand/v2 PRNG seeded per-test so failures reproduce.

Three integration tests:

- TestChaosErrorsPropagateAsEIO: 50% write / 50% read error rate
  drives ~200 direct-IO block ops; asserts every non-nil error is EIO
  and observed error fraction lies in [0.30, 0.70] so the wrapper is
  demonstrably active.
- TestChaosCloseTerminatesUnderLatency: 0% errors, MaxDelay=50ms, two
  concurrent IO goroutines for ~1s; asserts dev.Close() returns within
  10s once user fds are closed (AGENTS.md fd-close-before-Close
  discipline). Mirrors TestCloseAfterBackendErrors.
- TestChaosRecovery: 100% write error rate for N writes (each must
  return EIO), then flips the wrapper to passthrough and asserts N
  write+read roundtrips return the exact bytes written, verifying no
  residual corruption from the error phase.

Distinct from fault_integration_test.go, which only uses fully-on or
fully-off failure modes with no latency injection. Chaos exercises
partial failure rates and latency, the realistic failure mode for
remote or unreliable storage backends.

Tooling:
- Makefile: add test-chaos target (runs -test.run=TestChaos on the
  integration binary) and register it in .PHONY.
- TODO.md: replace the "Probabilistic chaos backend" bullet with a
  "(done)" summary referencing the new file.
- No CI changes needed: the existing test-integration job already
  picks up all //go:build integration tests in ./ublk/.

No new module dependencies, no changes to library code.
@ValentaTomas ValentaTomas requested a review from arkamar as a code owner May 2, 2026 00:08
@cursor
Copy link
Copy Markdown

cursor Bot commented May 2, 2026

PR Summary

Low Risk
Test-only changes, but the new root/O_DIRECT integration tests could add flakiness or longer runtimes in environments with different kernel/IO behavior.

Overview
Adds an integration-tagged chaos test wrapper that injects probabilistic EIO and per-call latency around a Backend, plus new integration tests covering error propagation, Device.Close() termination under latency, and recovery after disabling injected failures.

Updates the Makefile with a test-chaos target to run only TestChaos*, and removes the corresponding TODO item now that the chaos backend tests exist.

Reviewed by Cursor Bugbot for commit 9d18e23. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cccab5f. Configure here.

Comment thread ublk/chaos_integration_test.go Outdated
Previously WriteAt/ReadAt read writeErrorRate / readErrorRate
without holding the mutex while setRates writes them under it. The
race detector would have caught this as soon as TestChaosRecovery
called setRates concurrently with in-flight IO. Move the rate read
into sampleDecision (already under the mutex) via a chaosOp tag, so
the entire decision is taken atomically. Spotted by Cursor Bugbot
on #26.
ValentaTomas added a commit that referenced this pull request May 4, 2026
## Summary

The Azure-hosted apt mirror (\`azure.archive.ubuntu.com\`) returns
\`Temporary failure resolving\` periodically — observed today on PRs
#24, #26, and #27, all blocked by the same DNS blip while installing
\`linux-modules-extra\`.

This PR wraps the \`apt-get update\` + install in a 5-attempt
exponential backoff (5s, 10s, 20s, 40s, 80s — ~155s worst case before
giving up). The fast path (\`modprobe ublk_drv\` succeeding because the
module is already present in the runner image) is unchanged.

No code changes; CI workflow only.

## Test plan

- [ ] CI run for this branch loads ublk_drv and runs integration tests
cleanly.
- [ ] If the apt mirror flakes during the run, the retry should absorb
it; logs will show \`attempt N failed; sleeping...\`.
Copy link
Copy Markdown

@matthewlouisbrockman matthewlouisbrockman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g2g after the merge conflicts

Keep the new rapid-test helpers alongside the chaos test helpers and preserve both completed TODO removals so the branch matches current main cleanly.
Bring in the merged uring fuzz helpers alongside the chaos and rapid test targets so the branch stays conflict-free on the current main.
@ValentaTomas ValentaTomas merged commit 9739b2a into main May 4, 2026
13 checks passed
@ValentaTomas ValentaTomas deleted the tests/chaos-backend branch May 4, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants