Skip to content

roiyz-lb/async_validation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Lightbits Changed-Blocks API — Validation Suite

test_sync.py is a 20-test validation suite that exercises the Lightbits snapshot diff API and the block-copy pipeline built on top of it. It is the primary tool for verifying that the changed-blocks path works correctly across a range of realistic workloads, edge cases, and production failure modes.


Table of Contents

  1. What it validates
  2. Prerequisites
  3. Environment setup
  4. Finding volume UUIDs and block devices
  5. Running the tests
  6. Test descriptions and expected outcomes
  7. Volume size budget
  8. Output format
  9. Troubleshooting

1. What it validates

The suite verifies two orthogonal properties of the changed-blocks pipeline:

API accuracy — the diff endpoint (GET /api/v2/projects/{project}/snapshots/diff/{snapshotUUID}) returns lbaRanges that correctly describe which 4 KiB sectors changed between two snapshots. Each test asserts that:

  • Regions that were written appear in the diff.
  • Regions that were not written do not appear in the diff.
  • Bitmaps at sub-chunk boundaries encode the correct partial bit counts.
  • Paginated responses are fully stitched (all pages fetched, no ranges dropped).

Content integrity — after calling copy_ranges() to replay the diff onto a target volume, the sha256 checksum of every written region must match between source and target. Checksums are computed with external dd | sha256sum invocations directly on the block devices to avoid any in-process caching effect.


2. Prerequisites

Requirement Notes
AlmaLinux 9.7 VM running virsh start alma9; see README.md in this directory
SPDK nvmf_tgt active Run bash /images/vms/alma_flawless/setup_spdk_host.sh before booting the VM
NVMe-oF drives connected inside the VM nvme list inside the VM should show /dev/nvme1n1/dev/nvme4n1
Lightbits cluster installed and healthy Node-manager and API server running; lbcli cluster list returns Healthy
Two volumes on the same cluster Source and target volumes, each ≥ 2 GB (/dev/nvme* inside the VM)
Python 3.8+ on the system running the tests python3 --version
sync.py in the same directory test_sync.py imports it as _lb
Root or device-level read/write access dd writes directly to block devices

3. Environment setup

Step 1 — Set up cluster environment

This step involves setting up the necessary environment for one or more clusters, including starting NVMe-oF targets and virtual machines as required.

Step 2 — Verify target cluster storage devices

Access the target cluster environment and verify the presence and availability of NVMe devices. For example, inside the VM or cluster node, run:

nvme list
# Expected output: /dev/nvme1n1 through /dev/nvme4n1

Step 3 — Set the JWT token

The suite reads the JWT from the LIGHTBITS_JWT environment variable or the --jwt CLI flag. The token is stored in /images/vms/alma_flawless/.env:

export LIGHTBITS_JWT=$(grep '^jwt=' /images/vms/alma_flawless/.env | cut -d= -f2-)

Or pass it inline with --jwt "$JWT".


4. Finding volume UUIDs and block devices

List volumes via the REST API

curl -sk -H "Authorization: Bearer $LIGHTBITS_JWT" \
  https://192.168.122.247/api/v2/projects/default/volumes \
  | python3 -m json.tool | grep -E '"name"|"UUID"|"state"'

List volumes via lbcli (inside the VM)

ssh root@alma9
lbcli volume list

Map a volume UUID to its block device (inside the VM)

# List NVMe namespaces and their NQN identifiers:
nvme list -v

# Or correlate via sysfs:
for dev in /dev/nvme*n1; do
    nqn=$(cat /sys/class/nvme/$(basename ${dev%n1})/subsysnqn 2>/dev/null)
    echo "$dev  $nqn"
done

The NQN contains the volume UUID in the Lightbits naming scheme (nqn.2016-01.com.lightbitslabs:uuid:<volume-uuid>).


5. Running the tests

All examples assume the JWT is exported as LIGHTBITS_JWT.

Run the full suite

python3 /root/flawless/test_sync.py \
    --endpoint   https://192.168.122.247 \
    --source-uuid 29051d6a-e968-4dc7-aa9a-bd382af936d1 \
    --target-uuid 04aa8885-4b1e-4299-a5eb-59a3ef1cd3c3 \
    --source-device /dev/nvme0n1 \
    --target-device /dev/nvme0n2

Run a single named test

python3 /root/flawless/test_sync.py \
    --endpoint https://192.168.122.247 \
    --source-uuid ... --target-uuid ... \
    --source-device /dev/nvme0n1 --target-device /dev/nvme0n2 \
    --test scattered_writes

Keep snapshots after the run (for manual inspection)

python3 /root/flawless/test_sync.py ... --keep-snapshots

Show per-test notes for passing tests

python3 /root/flawless/test_sync.py ... --verbose

Override the Lightbits project

python3 /root/flawless/test_sync.py ... --project my-project

Exit codes

Code Meaning
0 All tests passed
1 One or more tests failed
130 Interrupted by Ctrl-C

6. Test descriptions and expected outcomes

Phase 1 — Sanity (tests 1–7)

These tests verify the fundamental correctness of the diff API and the copy pipeline in straightforward scenarios.


Test 1 — no_change_diff

What it does: Takes two snapshots back-to-back with no writes in between.

Validates: The diff API returns zero lbaRanges when nothing has changed. This is the baseline correctness check — a false-positive here means the API is non-deterministic or leaking state from other volumes.

Expected outcome: PASS with note diff returned 0 lbaRange(s).


Test 2 — full_scan_accuracy

What it does: Writes 50 MB of random data, takes one snapshot, and calls the diff API with no baseSnapshotUUID (full scan).

Validates:

  • The full-scan diff reports ≥ 50 MB of set LBAs (every written sector is tracked).
  • After syncing to the target, sha256(source[0:50 MB]) == sha256(target[0:50 MB]).

Expected outcome: PASS. The diff will typically report more than 50 MB because prior test runs have written to the volume — a full scan sees all historically written LBAs, not just those written in this test.


Test 3 — incremental_new_region

What it does: Writes 50 MB at offset 0, takes snap1. Writes 50 MB at offset 200 MB, takes snap2. Diffs snap2 vs snap1.

Validates:

  • The diff reports the new region (200–250 MB).
  • The diff does not report the already-snapshotted region (0–50 MB).
  • After syncing the incremental diff, sha256(200–250 MB) matches on both devices.

Expected outcome: PASS. This is the core incremental-diff correctness test — the API must not re-report regions that were already captured in the base snapshot.


Test 4 — overwrite_same_region

What it does: Writes random data pattern A at 300 MB → snap1. Writes random data pattern B at the same offset → snap2. Diffs snap2 vs snap1.

Validates:

  • The overwritten region (300–330 MB) is reported in the diff.
  • After syncing, the target has the new content (pattern B), not the old.

Expected outcome: PASS. Verifies that in-place overwrites are detected and that the diff reflects the most recent state, not a merge.


Test 5 — scattered_writes

What it does: Writes 1 MB at each of 10 offsets with 9 MB gaps between them (total footprint: 400–490 MB). Diffs against a baseline snapshot taken before the writes.

Validates:

  • All 10 written chunks appear in the diff.
  • The 9 MB gaps between chunks do not appear in the diff (sparse write accuracy).
  • Total reported set-LBA bytes ≈ 10 MB (not 90 MB), confirming the API is not reporting surrounding unwritten sectors.

Expected outcome: PASS. The gap assertion has a ±0.5 MB tolerance per chunk for 64-LBA bitmap alignment overhead.


Test 6 — sync_integrity

What it does: Writes 80 MB, takes a snapshot, performs a full sync (no base), then checksums the 80 MB region on both devices.

Validates: End-to-end byte-level integrity of the copy pipeline: diff API → iter_lba_runs bitmap decoding → copy_ranges block device I/O → target disk.

Expected outcome: PASS. A mismatch here typically means a seek offset calculation error in copy_ranges or truncated I/O.


Test 7 — incremental_sync_integrity

What it does: Two rounds of writes at adjacent offsets (600 MB and 650 MB), each followed by a snapshot and an incremental sync.

Validates:

  • Round 1: full sync, region 1 checksums match.
  • Round 2: incremental sync (snap2 vs snap1), region 2 checksums match.
  • Region 1 is undisturbed after the round-2 incremental sync (no spurious writes to target).

Expected outcome: PASS. The third assertion catches a class of bugs where the sync pipeline over-applies writes and clobbers already-correct data on the target.


Phase 2 — Edge cases & stress (tests 8–13)


Test 8 — large_sequential

What it does: Writes a single 150 MB contiguous block at offset 700 MB, diffs, syncs, and checksums.

Validates:

  • The diff reports ≥ 150 MB of set LBAs covering the written region.
  • The diff ranges are contiguous (zero gaps between adjacent sorted lbaRanges) — a non-contiguous result for a sequential write would indicate API fragmentation.
  • After sync, sha256 of the 150 MB region matches.

Expected outcome: PASS. The contiguity check will surface any case where the API splits a contiguous region into disjoint chunks unexpectedly.


Test 9 — delete_then_rewrite

What it does: Writes data A → snap1. Zero-fills the same region → snap2 (simulates deletion). Writes data B → snap3. Syncs and checksums after each step.

Validates:

  • The diff detects the zero-fill (deletion): region reported in snap2 vs snap1.
  • After syncing snap2, the target has zeros, not data A.
  • The diff detects the rewrite after zero-fill: region reported in snap3 vs snap2.
  • After syncing snap3, the target has data B.

Expected outcome: PASS. Exercises the delete/rewrite lifecycle that databases and object stores trigger frequently.


Test 10 — rapid_snapshot_chain

What it does: Creates 5 rounds of 15 MB writes at adjacent offsets, one snapshot per round. Walks the chain incrementally (snap0→snap1→…→snap5), syncing each step.

Validates:

  • Each step syncs approximately 15 MB (within ±10%).
  • Total bytes synced ≈ 75 MB (5 × 15 MB), confirming no over-copying or under-copying.
  • All 5 round regions checksum-match on target after the chain walk.

Expected outcome: PASS. A ±10% tolerance on total bytes synced allows for 64-LBA bitmap alignment overhead without false-positives.


Test 11 — idempotent_sync

What it does: Writes 15 MB, takes a snapshot, syncs it twice with identical API parameters, and compares the target content after each sync.

Validates:

  • Both syncs produce identical target content.
  • The second sync does not corrupt the target.
  • The range count is identical on both calls (API is deterministic).

Expected outcome: PASS. Idempotency is required for safe retry logic in production sync pipelines.


Test 12 — boundary_alignment

What it does: Writes 1 MB (256 LBAs) starting at an LBA offset that is 32 LBAs into a 64-LBA chunk boundary (LBA 250912 = 250880 + 32).

Validates:

  • The diff returns exactly 5 chunk ranges (partial first, 3 full middle, partial last).
  • The first chunk has exactly 32 set bits (bits 32–63).
  • The last chunk has exactly 32 set bits (bits 0–31).
  • After sync, sha256(LBA 250912 for 256 LBAs) matches on both devices.

Expected outcome: PASS. This is the most granular bitmap correctness test — it verifies that the 64-bit dataBitMap field is interpreted correctly at both the leading and trailing edges of a non-aligned write.


Test 13 — snapshot_lifecycle

What it does: Creates 3 snapshots in quick succession (sequentially, waiting for Available state before creating the next). Checks that consecutive diffs between them are empty. Deletes all 3 and confirms they are absent from the listing.

Validates:

  • The API can handle rapid sequential snapshot creation.
  • Diffs between snapshots with no intervening writes return 0 ranges.
  • DELETE /api/v2/projects/{project}/snapshots/{uuid} removes snapshots from the listing.

Expected outcome: PASS. The sequential creation constraint is a Lightbits limitation: only one snapshot per volume can be in Creating state at a time.


Phase 3 — Resilience & production failure modes (tests 14–20)


Test 14 — deleted_base_snapshot

What it does: Creates two snapshots, deletes the base (snap1), then attempts to diff snap2 vs the now-deleted snap1.

Validates: The diff API returns an error (HTTP 404 or 400) rather than silently returning incorrect data. The test asserts that a RuntimeError is raised and that its message contains a recognizable indicator of the missing snapshot.

Expected outcome: PASS with a note like correctly raised RuntimeError: 404 Not Found. A silent success here (returning ranges without error) would be a critical API bug.


Test 15 — pagination_stress

What it does: Writes 100 × 1 MB chunks at 3 MB stride (300 MB total footprint), generating ≥ 400 lbaRanges. Collects the full diff across all pages.

Validates:

  • All pages are fetched (pagination is transparent to the caller).
  • The stitched result covers all 100 written chunks (no chunks dropped between pages).
  • Total set-LBA bytes ≥ 100 MB (95% threshold for alignment overhead).
  • After sync, all 100 chunk checksums match.

Expected outcome: PASS. If the cluster returns enough ranges to require multiple pages, the note will say pagination exercised: N pages ✓. If everything fits in one page (small cluster), the test still passes — it just notes that pagination was not triggered.


Test 16 — partial_overwrite

What it does: Writes 35 MB at offset 1330 MB → snap1 → full sync to target. Then overwrites only the middle 15 MB (1340–1355 MB) → snap2. Incremental sync.

Validates:

  • The diff reports only the middle 15 MB region.
  • The leading 10 MB (1330–1340 MB) is not in the diff.
  • The trailing 10 MB (1355–1365 MB) is not in the diff.
  • After incremental sync, the full 35 MB region on the target matches the source (including the unchanged leading/trailing portions from the first sync).

Expected outcome: PASS. This is the canonical test for partial-region update accuracy — if the API over-reports, the trailing/leading assertions will fail.


Test 17 — non_adjacent_diff

What it does: 3 rounds of 20 MB writes at adjacent offsets (1370, 1390, 1410 MB), one snapshot per round. Diffs snap3 directly vs snap0, skipping snap1 and snap2.

Validates:

  • The non-adjacent diff (snap3 vs snap0) reports all 3 regions (all writes since snap0).
  • Skipping intermediate snapshots is equivalent to the cumulative effect of the chain.
  • After syncing via the non-adjacent diff, all 3 checksums match.

Expected outcome: PASS. This verifies that the API's base-snapshot parameter correctly computes the cumulative delta over multiple skipped snapshots, not just the delta from the immediately preceding snapshot.


Test 18 — long_chain_walk

What it does: 10-step incremental chain, 10 MB per round at non-overlapping offsets (1435–1535 MB). Walks the full chain incrementally (s0→s1→…→s10).

Validates:

  • Total bytes synced across 10 steps is within ±5% of total bytes written (100 MB).
  • All 10 round regions checksum-match on the target after the walk.

Expected outcome: PASS. The tight ±5% budget on total bytes is intentional — at 10 steps, any systematic over-copying will compound and exceed the threshold.


Test 19 — write_after_snapshot_isolation

What it does: Takes baseline snap0. Writes batch A (15 MB at 1540 MB) → snap1.

  • Diffs snap1 vs snap0: must contain A, must NOT contain B.
  • Diffs snap2 vs snap1: must contain B, must NOT contain A.

Validates: Point-in-time isolation — each snapshot only captures data written before it was taken. Writes that occurred after a snapshot must not appear in that snapshot's diff.

Expected outcome: PASS. A failure here means the snapshot does not enforce point-in-time semantics, which would break all incremental sync correctness guarantees.


Test 20 — sync_throughput_baseline

What it does: Writes 200 MB at offset 1575 MB, takes a snapshot, times the full sync (diff API + block copy), and asserts throughput ≥ 10 MB/s.

Validates:

  • The NVMe-oF path and copy pipeline sustain at least 10 MB/s end-to-end.
  • After sync, the 200 MB region checksums match.
  • Throughput (in MB/s) is recorded in the test notes for trend tracking.

Expected outcome: PASS with a note like in 18.3s → 10.9 MB/s. The 10 MB/s floor catches severe regressions (e.g., SPDK misconfiguration, NVMe-oF reconnect storms, or copy-loop bugs) while leaving headroom for a shared, lightly-tuned VM environment.


7. Volume size budget

All 20 tests write to dedicated, non-overlapping byte regions. No test re-uses another test's region. The worst-case cumulative footprint on the source volume is:

Phase Tests Footprint
Phase 1 (sanity) 1–7 ≤ 601 MB
Phase 2 (edge cases) 8–13 ≤ 281 MB
Phase 3 (resilience) 14–20 ≤ 530 MB
Total 1–20 ≤ 1.13 GB

Each volume should be at least 2 GB to accommodate the test footprint plus filesystem and metadata overhead.


8. Output format

══════════════════════════════════════════════════════════════════════
  Lightbits sync — sanity tests  (20 tests)
  endpoint  https://192.168.122.247
  source    vol-src  (/dev/nvme0n1)
  target    vol-dst  (/dev/nvme0n2)
══════════════════════════════════════════════════════════════════════

  [1/20] no_change_diff .......................... PASS  (3.2s)
  [2/20] full_scan_accuracy ...................... PASS  (12.8s)
  ...
  [20/20] sync_throughput_baseline ............... PASS  (28.4s)

══════════════════════════════════════════════════════════════════════
  TEST                                   RESULT   DURATION
──────────────────────────────────────────────────────────────────────
  no_change_diff                         PASS        3.2s
    · diff returned 0 lbaRange(s)
  full_scan_accuracy                     PASS       12.8s
    · wrote 50 MB at 0 MB  src=3f8a2c1d7b4e9a02…
    · full scan: 47 ranges / 2 page(s) / 12800 LBAs (50.0 MB reported)
    · synced 47 ranges / 50.0 MB
    · src=3f8a2c1d7b4e9a02…  dst=3f8a2c1d7b4e9a02…
  ...
──────────────────────────────────────────────────────────────────────
  20/20 passed  |  ALL PASSED  |  total 483.7s
══════════════════════════════════════════════════════════════════════
  • Per-test notes appear under failing tests by default; --verbose shows them for passing tests too.
  • A spinning braille donut (⠋⠙⠹…) animates on the current line while each test runs.
  • Ctrl-C exits immediately (SIGKILL to the subprocess process group).

9. Troubleshooting

Cannot import sync.py

Ensure sync.py is in /root/flawless/ and that test_sync.py is run from the same directory or with an absolute path.

ls /root/flawless/sync.py

JWT or LIGHTBITS_JWT required

Set the token:

export LIGHTBITS_JWT=$(grep '^jwt=' /images/vms/alma_flawless/.env | cut -d= -f2-)

Another snapshot with same name exists (HTTP 400)

This happens if snapshots from a previous run were not cleaned up. The suite embeds a per-run timestamp in every snapshot name to avoid this, but if the VM was reverted to a snapshot that includes old test snapshots, you may need to delete them manually:

# List all test snapshots:
curl -sk -H "Authorization: Bearer $LIGHTBITS_JWT" \
  https://192.168.122.247/api/v2/projects/default/snapshots \
  | python3 -m json.tool | grep '"name"' | grep tsync

# Delete by UUID:
curl -sk -X DELETE -H "Authorization: Bearer $LIGHTBITS_JWT" \
  https://192.168.122.247/api/v2/projects/default/snapshots/<UUID>

Creating a snapshot when another snapshot is creating is not allowed (HTTP 400)

Lightbits enforces a per-volume limit of one snapshot in Creating state at a time. This should not occur in the test suite (all snapshot creation uses snap_take(), which waits for Available before returning), but if it does, wait a few seconds and retry.

See the full troubleshooting section in README.md for the root cause and fix.

Throughput test fails (sync_throughput_baseline)

If test_sync_throughput_baseline fails with MB/s below 10:

  1. Check that SPDK is running: ps aux | grep nvmf_tgt
  2. Check NVMe-oF drives are connected inside the VM: nvme list
  3. Check for other heavy I/O on the host: iostat -x 1 3
  4. If the VM is freshly booted, wait ~30 seconds for the NVMe-oF connection to stabilize.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages