test_sync.py is a 20-test validation suite that exercises the Lightbits
snapshot diff API and the block-copy pipeline built on top of it. It is the
primary tool for verifying that the changed-blocks path works correctly across
a range of realistic workloads, edge cases, and production failure modes.
- What it validates
- Prerequisites
- Environment setup
- Finding volume UUIDs and block devices
- Running the tests
- Test descriptions and expected outcomes
- Volume size budget
- Output format
- Troubleshooting
The suite verifies two orthogonal properties of the changed-blocks pipeline:
API accuracy — the diff endpoint (GET /api/v2/projects/{project}/snapshots/diff/{snapshotUUID})
returns lbaRanges that correctly describe which 4 KiB sectors changed between two
snapshots. Each test asserts that:
- Regions that were written appear in the diff.
- Regions that were not written do not appear in the diff.
- Bitmaps at sub-chunk boundaries encode the correct partial bit counts.
- Paginated responses are fully stitched (all pages fetched, no ranges dropped).
Content integrity — after calling copy_ranges() to replay the diff onto a
target volume, the sha256 checksum of every written region must match between source
and target. Checksums are computed with external dd | sha256sum invocations directly
on the block devices to avoid any in-process caching effect.
| Requirement | Notes |
|---|---|
| AlmaLinux 9.7 VM running | virsh start alma9; see README.md in this directory |
| SPDK nvmf_tgt active | Run bash /images/vms/alma_flawless/setup_spdk_host.sh before booting the VM |
| NVMe-oF drives connected inside the VM | nvme list inside the VM should show /dev/nvme1n1–/dev/nvme4n1 |
| Lightbits cluster installed and healthy | Node-manager and API server running; lbcli cluster list returns Healthy |
| Two volumes on the same cluster | Source and target volumes, each ≥ 2 GB (/dev/nvme* inside the VM) |
| Python 3.8+ on the system running the tests | python3 --version |
sync.py in the same directory |
test_sync.py imports it as _lb |
| Root or device-level read/write access | dd writes directly to block devices |
This step involves setting up the necessary environment for one or more clusters, including starting NVMe-oF targets and virtual machines as required.
Access the target cluster environment and verify the presence and availability of NVMe devices. For example, inside the VM or cluster node, run:
nvme list
# Expected output: /dev/nvme1n1 through /dev/nvme4n1The suite reads the JWT from the LIGHTBITS_JWT environment variable or the
--jwt CLI flag. The token is stored in /images/vms/alma_flawless/.env:
export LIGHTBITS_JWT=$(grep '^jwt=' /images/vms/alma_flawless/.env | cut -d= -f2-)Or pass it inline with --jwt "$JWT".
curl -sk -H "Authorization: Bearer $LIGHTBITS_JWT" \
https://192.168.122.247/api/v2/projects/default/volumes \
| python3 -m json.tool | grep -E '"name"|"UUID"|"state"'ssh root@alma9
lbcli volume list# List NVMe namespaces and their NQN identifiers:
nvme list -v
# Or correlate via sysfs:
for dev in /dev/nvme*n1; do
nqn=$(cat /sys/class/nvme/$(basename ${dev%n1})/subsysnqn 2>/dev/null)
echo "$dev $nqn"
doneThe NQN contains the volume UUID in the Lightbits naming scheme
(nqn.2016-01.com.lightbitslabs:uuid:<volume-uuid>).
All examples assume the JWT is exported as LIGHTBITS_JWT.
python3 /root/flawless/test_sync.py \
--endpoint https://192.168.122.247 \
--source-uuid 29051d6a-e968-4dc7-aa9a-bd382af936d1 \
--target-uuid 04aa8885-4b1e-4299-a5eb-59a3ef1cd3c3 \
--source-device /dev/nvme0n1 \
--target-device /dev/nvme0n2python3 /root/flawless/test_sync.py \
--endpoint https://192.168.122.247 \
--source-uuid ... --target-uuid ... \
--source-device /dev/nvme0n1 --target-device /dev/nvme0n2 \
--test scattered_writespython3 /root/flawless/test_sync.py ... --keep-snapshotspython3 /root/flawless/test_sync.py ... --verbosepython3 /root/flawless/test_sync.py ... --project my-project| Code | Meaning |
|---|---|
0 |
All tests passed |
1 |
One or more tests failed |
130 |
Interrupted by Ctrl-C |
These tests verify the fundamental correctness of the diff API and the copy pipeline in straightforward scenarios.
What it does: Takes two snapshots back-to-back with no writes in between.
Validates: The diff API returns zero lbaRanges when nothing has changed.
This is the baseline correctness check — a false-positive here means the API
is non-deterministic or leaking state from other volumes.
Expected outcome: PASS with note diff returned 0 lbaRange(s).
What it does: Writes 50 MB of random data, takes one snapshot, and calls
the diff API with no baseSnapshotUUID (full scan).
Validates:
- The full-scan diff reports ≥ 50 MB of set LBAs (every written sector is tracked).
- After syncing to the target,
sha256(source[0:50 MB]) == sha256(target[0:50 MB]).
Expected outcome: PASS. The diff will typically report more than 50 MB because
prior test runs have written to the volume — a full scan sees all historically written
LBAs, not just those written in this test.
What it does: Writes 50 MB at offset 0, takes snap1. Writes 50 MB at offset
200 MB, takes snap2. Diffs snap2 vs snap1.
Validates:
- The diff reports the new region (200–250 MB).
- The diff does not report the already-snapshotted region (0–50 MB).
- After syncing the incremental diff,
sha256(200–250 MB)matches on both devices.
Expected outcome: PASS. This is the core incremental-diff correctness test —
the API must not re-report regions that were already captured in the base snapshot.
What it does: Writes random data pattern A at 300 MB → snap1. Writes random
data pattern B at the same offset → snap2. Diffs snap2 vs snap1.
Validates:
- The overwritten region (300–330 MB) is reported in the diff.
- After syncing, the target has the new content (pattern B), not the old.
Expected outcome: PASS. Verifies that in-place overwrites are detected and
that the diff reflects the most recent state, not a merge.
What it does: Writes 1 MB at each of 10 offsets with 9 MB gaps between them (total footprint: 400–490 MB). Diffs against a baseline snapshot taken before the writes.
Validates:
- All 10 written chunks appear in the diff.
- The 9 MB gaps between chunks do not appear in the diff (sparse write accuracy).
- Total reported set-LBA bytes ≈ 10 MB (not 90 MB), confirming the API is not reporting surrounding unwritten sectors.
Expected outcome: PASS. The gap assertion has a ±0.5 MB tolerance per chunk
for 64-LBA bitmap alignment overhead.
What it does: Writes 80 MB, takes a snapshot, performs a full sync (no base), then checksums the 80 MB region on both devices.
Validates: End-to-end byte-level integrity of the copy pipeline: diff API →
iter_lba_runs bitmap decoding → copy_ranges block device I/O → target disk.
Expected outcome: PASS. A mismatch here typically means a seek offset
calculation error in copy_ranges or truncated I/O.
What it does: Two rounds of writes at adjacent offsets (600 MB and 650 MB), each followed by a snapshot and an incremental sync.
Validates:
- Round 1: full sync, region 1 checksums match.
- Round 2: incremental sync (snap2 vs snap1), region 2 checksums match.
- Region 1 is undisturbed after the round-2 incremental sync (no spurious writes to target).
Expected outcome: PASS. The third assertion catches a class of bugs where the
sync pipeline over-applies writes and clobbers already-correct data on the target.
What it does: Writes a single 150 MB contiguous block at offset 700 MB, diffs, syncs, and checksums.
Validates:
- The diff reports ≥ 150 MB of set LBAs covering the written region.
- The diff ranges are contiguous (zero gaps between adjacent sorted lbaRanges) — a non-contiguous result for a sequential write would indicate API fragmentation.
- After sync, sha256 of the 150 MB region matches.
Expected outcome: PASS. The contiguity check will surface any case where the
API splits a contiguous region into disjoint chunks unexpectedly.
What it does: Writes data A → snap1. Zero-fills the same region → snap2
(simulates deletion). Writes data B → snap3. Syncs and checksums after each step.
Validates:
- The diff detects the zero-fill (deletion): region reported in
snap2 vs snap1. - After syncing
snap2, the target has zeros, not data A. - The diff detects the rewrite after zero-fill: region reported in
snap3 vs snap2. - After syncing
snap3, the target has data B.
Expected outcome: PASS. Exercises the delete/rewrite lifecycle that databases
and object stores trigger frequently.
What it does: Creates 5 rounds of 15 MB writes at adjacent offsets, one snapshot per round. Walks the chain incrementally (snap0→snap1→…→snap5), syncing each step.
Validates:
- Each step syncs approximately 15 MB (within ±10%).
- Total bytes synced ≈ 75 MB (5 × 15 MB), confirming no over-copying or under-copying.
- All 5 round regions checksum-match on target after the chain walk.
Expected outcome: PASS. A ±10% tolerance on total bytes synced allows for
64-LBA bitmap alignment overhead without false-positives.
What it does: Writes 15 MB, takes a snapshot, syncs it twice with identical API parameters, and compares the target content after each sync.
Validates:
- Both syncs produce identical target content.
- The second sync does not corrupt the target.
- The range count is identical on both calls (API is deterministic).
Expected outcome: PASS. Idempotency is required for safe retry logic in
production sync pipelines.
What it does: Writes 1 MB (256 LBAs) starting at an LBA offset that is 32 LBAs into a 64-LBA chunk boundary (LBA 250912 = 250880 + 32).
Validates:
- The diff returns exactly 5 chunk ranges (partial first, 3 full middle, partial last).
- The first chunk has exactly 32 set bits (bits 32–63).
- The last chunk has exactly 32 set bits (bits 0–31).
- After sync,
sha256(LBA 250912 for 256 LBAs)matches on both devices.
Expected outcome: PASS. This is the most granular bitmap correctness test —
it verifies that the 64-bit dataBitMap field is interpreted correctly at both
the leading and trailing edges of a non-aligned write.
What it does: Creates 3 snapshots in quick succession (sequentially, waiting for
Available state before creating the next). Checks that consecutive diffs between
them are empty. Deletes all 3 and confirms they are absent from the listing.
Validates:
- The API can handle rapid sequential snapshot creation.
- Diffs between snapshots with no intervening writes return 0 ranges.
DELETE /api/v2/projects/{project}/snapshots/{uuid}removes snapshots from the listing.
Expected outcome: PASS. The sequential creation constraint is a Lightbits
limitation: only one snapshot per volume can be in Creating state at a time.
What it does: Creates two snapshots, deletes the base (snap1), then attempts
to diff snap2 vs the now-deleted snap1.
Validates: The diff API returns an error (HTTP 404 or 400) rather than silently
returning incorrect data. The test asserts that a RuntimeError is raised and that
its message contains a recognizable indicator of the missing snapshot.
Expected outcome: PASS with a note like correctly raised RuntimeError: 404 Not Found.
A silent success here (returning ranges without error) would be a critical API bug.
What it does: Writes 100 × 1 MB chunks at 3 MB stride (300 MB total footprint),
generating ≥ 400 lbaRanges. Collects the full diff across all pages.
Validates:
- All pages are fetched (pagination is transparent to the caller).
- The stitched result covers all 100 written chunks (no chunks dropped between pages).
- Total set-LBA bytes ≥ 100 MB (95% threshold for alignment overhead).
- After sync, all 100 chunk checksums match.
Expected outcome: PASS. If the cluster returns enough ranges to require multiple
pages, the note will say pagination exercised: N pages ✓. If everything fits in one
page (small cluster), the test still passes — it just notes that pagination was not
triggered.
What it does: Writes 35 MB at offset 1330 MB → snap1 → full sync to target.
Then overwrites only the middle 15 MB (1340–1355 MB) → snap2. Incremental sync.
Validates:
- The diff reports only the middle 15 MB region.
- The leading 10 MB (1330–1340 MB) is not in the diff.
- The trailing 10 MB (1355–1365 MB) is not in the diff.
- After incremental sync, the full 35 MB region on the target matches the source (including the unchanged leading/trailing portions from the first sync).
Expected outcome: PASS. This is the canonical test for partial-region update
accuracy — if the API over-reports, the trailing/leading assertions will fail.
What it does: 3 rounds of 20 MB writes at adjacent offsets (1370, 1390, 1410 MB),
one snapshot per round. Diffs snap3 directly vs snap0, skipping snap1 and snap2.
Validates:
- The non-adjacent diff (snap3 vs snap0) reports all 3 regions (all writes since snap0).
- Skipping intermediate snapshots is equivalent to the cumulative effect of the chain.
- After syncing via the non-adjacent diff, all 3 checksums match.
Expected outcome: PASS. This verifies that the API's base-snapshot parameter
correctly computes the cumulative delta over multiple skipped snapshots, not just the
delta from the immediately preceding snapshot.
What it does: 10-step incremental chain, 10 MB per round at non-overlapping offsets (1435–1535 MB). Walks the full chain incrementally (s0→s1→…→s10).
Validates:
- Total bytes synced across 10 steps is within ±5% of total bytes written (100 MB).
- All 10 round regions checksum-match on the target after the walk.
Expected outcome: PASS. The tight ±5% budget on total bytes is intentional —
at 10 steps, any systematic over-copying will compound and exceed the threshold.
What it does: Takes baseline snap0. Writes batch A (15 MB at 1540 MB) → snap1.
- Diffs
snap1 vs snap0: must contain A, must NOT contain B. - Diffs
snap2 vs snap1: must contain B, must NOT contain A.
Validates: Point-in-time isolation — each snapshot only captures data written before it was taken. Writes that occurred after a snapshot must not appear in that snapshot's diff.
Expected outcome: PASS. A failure here means the snapshot does not enforce
point-in-time semantics, which would break all incremental sync correctness guarantees.
What it does: Writes 200 MB at offset 1575 MB, takes a snapshot, times the full sync (diff API + block copy), and asserts throughput ≥ 10 MB/s.
Validates:
- The NVMe-oF path and copy pipeline sustain at least 10 MB/s end-to-end.
- After sync, the 200 MB region checksums match.
- Throughput (in MB/s) is recorded in the test notes for trend tracking.
Expected outcome: PASS with a note like in 18.3s → 10.9 MB/s. The 10 MB/s
floor catches severe regressions (e.g., SPDK misconfiguration, NVMe-oF reconnect storms,
or copy-loop bugs) while leaving headroom for a shared, lightly-tuned VM environment.
All 20 tests write to dedicated, non-overlapping byte regions. No test re-uses another test's region. The worst-case cumulative footprint on the source volume is:
| Phase | Tests | Footprint |
|---|---|---|
| Phase 1 (sanity) | 1–7 | ≤ 601 MB |
| Phase 2 (edge cases) | 8–13 | ≤ 281 MB |
| Phase 3 (resilience) | 14–20 | ≤ 530 MB |
| Total | 1–20 | ≤ 1.13 GB |
Each volume should be at least 2 GB to accommodate the test footprint plus filesystem and metadata overhead.
══════════════════════════════════════════════════════════════════════
Lightbits sync — sanity tests (20 tests)
endpoint https://192.168.122.247
source vol-src (/dev/nvme0n1)
target vol-dst (/dev/nvme0n2)
══════════════════════════════════════════════════════════════════════
[1/20] no_change_diff .......................... PASS (3.2s)
[2/20] full_scan_accuracy ...................... PASS (12.8s)
...
[20/20] sync_throughput_baseline ............... PASS (28.4s)
══════════════════════════════════════════════════════════════════════
TEST RESULT DURATION
──────────────────────────────────────────────────────────────────────
no_change_diff PASS 3.2s
· diff returned 0 lbaRange(s)
full_scan_accuracy PASS 12.8s
· wrote 50 MB at 0 MB src=3f8a2c1d7b4e9a02…
· full scan: 47 ranges / 2 page(s) / 12800 LBAs (50.0 MB reported)
· synced 47 ranges / 50.0 MB
· src=3f8a2c1d7b4e9a02… dst=3f8a2c1d7b4e9a02…
...
──────────────────────────────────────────────────────────────────────
20/20 passed | ALL PASSED | total 483.7s
══════════════════════════════════════════════════════════════════════
- Per-test notes appear under failing tests by default;
--verboseshows them for passing tests too. - A spinning braille donut (
⠋⠙⠹…) animates on the current line while each test runs. - Ctrl-C exits immediately (SIGKILL to the subprocess process group).
Ensure sync.py is in /root/flawless/ and that test_sync.py is run from the
same directory or with an absolute path.
ls /root/flawless/sync.pySet the token:
export LIGHTBITS_JWT=$(grep '^jwt=' /images/vms/alma_flawless/.env | cut -d= -f2-)This happens if snapshots from a previous run were not cleaned up. The suite embeds a per-run timestamp in every snapshot name to avoid this, but if the VM was reverted to a snapshot that includes old test snapshots, you may need to delete them manually:
# List all test snapshots:
curl -sk -H "Authorization: Bearer $LIGHTBITS_JWT" \
https://192.168.122.247/api/v2/projects/default/snapshots \
| python3 -m json.tool | grep '"name"' | grep tsync
# Delete by UUID:
curl -sk -X DELETE -H "Authorization: Bearer $LIGHTBITS_JWT" \
https://192.168.122.247/api/v2/projects/default/snapshots/<UUID>Lightbits enforces a per-volume limit of one snapshot in Creating state at a time.
This should not occur in the test suite (all snapshot creation uses snap_take(), which
waits for Available before returning), but if it does, wait a few seconds and retry.
See the full troubleshooting section in README.md for the root cause and fix.
If test_sync_throughput_baseline fails with MB/s below 10:
- Check that SPDK is running:
ps aux | grep nvmf_tgt - Check NVMe-oF drives are connected inside the VM:
nvme list - Check for other heavy I/O on the host:
iostat -x 1 3 - If the VM is freshly booted, wait ~30 seconds for the NVMe-oF connection to stabilize.