Skip to content

feat(snapshots): volume snapshots, rollback & listing#14

Merged
roiyz-lb merged 7 commits into
mainfrom
feat/snapshots
Jun 29, 2026
Merged

feat(snapshots): volume snapshots, rollback & listing#14
roiyz-lb merged 7 commits into
mainfrom
feat/snapshots

Conversation

@roiyz-lb

@roiyz-lb roiyz-lb commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds Proxmox snapshot support to the Lightbits storage plugin, backed by native LightOS snapshots over the REST API: create, delete, list, and rollback, wired through qm snapshot / qm rollback and the Proxmox UI.

What's included

  • Snapshot lifecyclevolume_snapshot, volume_snapshot_delete, volume_snapshot_rollback, volume_rollback_is_possible, volume_snapshot_info, plus volume_has_feature / parse_volname wiring.
  • Native server-side rollback (PUT /api/v2/projects/{proj}/volumes/{uuid}/rollback) — near-instant, no host-side data copy, and preserves the volume's thin-provisioned allocation.
  • Collision-safe naming — snapshot names embed the source volume UUID (snap-<uuid>-<name>), so names stay unique per project and map back to the Proxmox name without relying on labels.
  • Resize-safety guardvolume_rollback_is_possible refuses a rollback that would shrink a volume grown after the snapshot was taken, keeping the device and the VM config size consistent.
  • Idempotent delete — tolerates already-gone, mid-delete, and racing concurrent deletes by re-checking state rather than matching error strings.
  • No orphaned snapshotsfree_image removes a volume's snapshots first, so destroying a VM leaves none behind.
  • Online snapshots of a running guest are crash-consistent (filesystem-consistent when the guest runs qemu-guest-agent).

Testing

  • Unit suite: 178 tests pass (prove -I t/stubs t/*.t), incl. naming round-trip, request shapes, the size guard, idempotent delete, and volume_has_feature.
  • New live e2e harness t/e2e/snapshots.sh (parameterized, no secrets) with a units-first gate: snapshot create, data-revert rollback, resize-rollback guard, orphan-on-destroy.
  • Validated end-to-end on a single-node cluster and a 4-node multipath LightOS cluster (v3.19.1): online/offline snapshots, rollback-to-earlier with later snapshots surviving, real path-loss failover with auto-recovery, and multi-volume subsystem-disconnect safety — all with byte-exact data verification.

Out of scope / follow-ups

  • RAM/vmstate live checkpoints (disk-only snapshots for now).
  • File-level recovery by cloning a snapshot to a scratch volume.
  • A separate small PR will improve activate_volume to restore an administratively-dropped NVMe path on re-activate (multipath-redundancy nit; transient losses already auto-recover via the kernel).

Summary by CodeRabbit

Release Notes

  • New Features
    • Added Lightbits-backed, point-in-time volume snapshots with project-scoped handling, and server-side rollback that preserves thin-provision allocation and supports fast restore via native rollback semantics.
    • Improved online guest snapshot crash-consistency when the guest agent is available, and added safeguards that block rollback if the volume was resized after the snapshot.
  • Bug Fixes
    • Improved snapshot-qualified volume name parsing/paths and enhanced snapshot cleanup behavior during volume freeing (best-effort; non-fatal).
  • Tests
    • Added unit coverage for snapshot naming, lifecycle, rollback eligibility, snapshot info, and idempotent deletion, plus a new end-to-end snapshot/rollback workflow test.
  • Documentation
    • Updated README to remove the “no snapshots” limitation and expanded the roadmap items.

…ts REST

Implement Proxmox snapshot support backed by native LightOS snapshots:
- volume_snapshot / volume_snapshot_delete / volume_snapshot_rollback /
  volume_rollback_is_possible / volume_snapshot_info, plus volume_has_feature
  and parse_volname wiring.
- Native server-side rollback (PUT .../volumes/{uuid}/rollback): near-instant,
  no host-side data copy, and preserves the volume's thin-provisioned allocation.
- Snapshot names embed the source volume UUID (snap-<uuid>-<name>) so names stay
  unique per project and map back to the Proxmox name without labels.
- volume_rollback_is_possible refuses a rollback that would shrink a volume
  resized after the snapshot was taken, keeping the device and the VM config
  size consistent.
- Idempotent snapshot delete: tolerates already-gone, mid-delete, and racing
  concurrent deletes by re-checking state rather than matching error strings.
- free_image removes a volume's snapshots first so none are orphaned on destroy.
- Online snapshots of a running guest are crash-consistent (filesystem-consistent
  when the guest runs qemu-guest-agent).

Add unit tests (t/*.t) and a live e2e harness (t/e2e/snapshots.sh) with a
units-first gate. Validated end-to-end on single-node and 4-node multipath
LightOS clusters.
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR implements full volume snapshot lifecycle support in LightbitsPlugin.pm: snapshot naming helpers, _rescan_controller, extended parse_volname, updated free_image (snapshot cleanup before volume delete), a replaced volume_has_feature, and six new snapshot API methods. Unit tests and an end-to-end Bash test are added, and README/CHANGELOG are updated.

Changes

Lightbits Volume Snapshot Support

Layer / File(s) Summary
Snapshot naming helpers, _rescan_controller, and parse_volname extension
LightbitsPlugin.pm, t/snap_naming.t
Imports timegm; adds _rescan_controller for synchronous NVMe controller rescans, snapshot name encode/decode helpers, _snapshots_for_volume, _snapshot_uuid_by_name, and _epoch_from_iso8601 for UTC timestamp parsing; extends parse_volname to handle vm-<vmid>-<uuid>@<snap> format. Tests cover round-trip encode/decode with hyphens and rejection of non-conforming names.
volume_has_feature replacement, free_image snapshot cleanup, path update, and resize rescan
LightbitsPlugin.pm, t/volume_has_feature.t, t/free_image_snapshots.t
Replaces volume_has_feature with a nested feature-map keyed by snapshot/base/current state, changing its signature to accept $scfg, $snapname, and $opts; updates free_image to best-effort-delete all Lightbits snapshots tied to the volume UUID before the volume itself; updates path to extract embedded snapshot names from volname; changes volume_resize to call _rescan_controller for post-resize kernel sync. Tests cover all feature combinations, snapshot deletion ordering, and best-effort failure handling.
Snapshot lifecycle APIs and comprehensive unit tests
LightbitsPlugin.pm, t/volume_snapshot.t, t/volume_snapshot_delete.t, t/volume_rollback_is_possible.t, t/volume_snapshot_rollback.t, t/volume_snapshot_info.t
Adds volume_snapshot (POST to project-scoped endpoint + poll for Available), idempotent _delete_snapshot and volume_snapshot_delete (with recheck-on-failure reconciliation), volume_rollback_is_possible (blocks if volume grew after snapshot), volume_snapshot_rollback (PUT + poll + post-rollback rescan), and volume_snapshot_info (client-side filters by naming scheme, sorts by timestamp, maps to Proxmox snapshot names). Each function is covered by a dedicated Perl unit test with stubbed _api covering success, failure, and edge cases.
End-to-end Bash test
t/e2e/snapshots.sh
Adds environment-driven live e2e test covering snapshot creation, data-hash rollback verification (write A → snapshot → overwrite B → rollback → verify A), resize-rollback guard assertion (resized volumes block rollback), and destroy-cleanup validation (VM destroy removes storage entries). Includes unit-test gate via prove and safety cleanup trap.
README, CHANGELOG, and config updates
README.md, CHANGELOG.md, .coderabbit.yaml
Removes the "No snapshots" limitation from README, adds Phase 1 roadmap entry for native server-side snapshot/rollback, narrows Phase 2 entry to focus on clones, adds CHANGELOG entry describing snapshot behavior (point-in-time, crash-consistency with qemu-guest-agent, server-side rollback, thin-provision preservation, no shrink-after-grow, best-effort cleanup), and configures CodeRabbit docstring checking in warning mode.

Sequence Diagram(s)

sequenceDiagram
  participant Proxmox
  participant volume_snapshot
  participant volume_snapshot_delete
  participant volume_rollback_is_possible
  participant volume_snapshot_rollback
  participant _api
  Proxmox->>volume_snapshot: snap_name
  volume_snapshot->>_api: POST .../snapshots
  volume_snapshot->>_api: GET .../snapshots/{id} (poll)
  volume_snapshot-->>Proxmox: undef or die
  Proxmox->>volume_rollback_is_possible: vol_name, snap, blockers
  volume_rollback_is_possible->>_api: GET volume + snapshot sizes
  volume_rollback_is_possible-->>Proxmox: 1 or die (resized)
  Proxmox->>volume_snapshot_rollback: vol_name, snap
  volume_snapshot_rollback->>_api: PUT .../rollback {srcSnapshotUUID}
  volume_snapshot_rollback->>_api: GET volume (poll Available)
  volume_snapshot_rollback->>_rescan_controller: nvme ns-rescan
  volume_snapshot_rollback-->>Proxmox: undef or die
  Proxmox->>volume_snapshot_delete: vol_name, snap
  volume_snapshot_delete->>_api: DELETE .../snapshots/{uuid}
  volume_snapshot_delete->>_api: GET .../snapshots/{uuid} (re-check)
  volume_snapshot_delete-->>Proxmox: undef or die
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • LightBitsLabs/pve-lightbits#9: Both PRs modify LightbitsPlugin.pm's resize flow; PR #9 adds the initial volume_resize method with nvme ns-rescan, while this PR refactors resizing to use the new _rescan_controller helper and updates volume_has_feature to support expanded snapshot-aware feature semantics.

🐰 Snapshots dance where data flows,
Rollback whispers, time rewinds,
Lightbits magic makes it so—
Point-in-time: the thread that binds,
Guard against the shrinking blow!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(snapshots): volume snapshots, rollback & listing' accurately and concisely describes the main feature additions in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 25.00% which is sufficient. The required threshold is 10.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/snapshots

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]
coderabbitai Bot previously requested changes Jun 14, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 17: The changelog statement "Freeing a volume also removes its snapshots
so none are left behind" incorrectly guarantees complete snapshot removal, but
the underlying free_image function only performs best-effort deletion and may
leave snapshots behind if a delete operation fails. Reword the sentence about
volume freeing and snapshot cleanup to clarify that snapshot deletion is
best-effort and cannot guarantee that no snapshots remain, removing the promise
of complete cleanup.

In `@LightbitsPlugin.pm`:
- Around line 920-925: The volume_snapshot_delete function incorrectly treats
all _snap_uuid failures as successful idempotent deletes, masking transient API,
authentication, and listing errors. Instead of blindly returning success
whenever eval catches an error from _snap_uuid, distinguish between a genuine
"snapshot not found" condition (which should return success for idempotency) and
actual transient errors (which should propagate as failures). Modify the code
around the eval block in volume_snapshot_delete to capture the error message
from the eval failure, check if it specifically indicates the snapshot does not
exist, and only return success in that case—otherwise die or return an error to
prevent silent failure of transient issues.

In `@t/e2e/snapshots.sh`:
- Around line 83-88: Replace the hardcoded `/tmp/lb_e2e_rb_err` path with a
dynamically created temporary file using mktemp. Create the temp file variable
(e.g., by storing the output of mktemp in a variable near the top of the
configuration section), update the qm rollback command to redirect to this
variable instead of the hardcoded path, update the grep command to read from
this variable, and add cleanup code at the end of the test (or in the overall
cleanup section) to remove the temporary file using rm to prevent accumulation
of stale files.

In `@t/volume_snapshot_rollback.t`:
- Around line 47-54: The current test only validates the happy path where the
volume is immediately Available. Add additional test cases to the existing test
file to verify the state transition behavior of the volume_snapshot_rollback
method: first, add a test case where the volume starts in a non-Available state
like Updating and transitions to Available after one or more polling attempts
(by setting $vol_state to different values across loop iterations before the
rollback call), and second, add a test case where the volume enters a terminal
failure state like Failed, Deleting, or Deleted (by setting $vol_state to one of
these values and verifying that the rollback fails fast with an appropriate
error). This will comprehensively test both the polling loop logic and the
fail-fast behavior referenced in the context.
- Around line 26-42: Add a stub for the _rescan_controller function within the
same mocking block where _api is being stubbed for
PVE::Storage::Custom::LightbitsPlugin. The stub should accept the parameters
that the actual implementation passes to it (typically storage ID and volume
UUID) and return without performing any operations. This prevents the function
from executing unintended I/O or system interactions during the test while still
satisfying the call that occurs after the rollback operation completes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f3fda83f-3e69-444d-a74c-a25090868411

📥 Commits

Reviewing files that changed from the base of the PR and between 1da3a5f and bf331bc.

📒 Files selected for processing (12)
  • CHANGELOG.md
  • LightbitsPlugin.pm
  • README.md
  • t/e2e/snapshots.sh
  • t/free_image_snapshots.t
  • t/snap_naming.t
  • t/volume_has_feature.t
  • t/volume_rollback_is_possible.t
  • t/volume_snapshot.t
  • t/volume_snapshot_delete.t
  • t/volume_snapshot_info.t
  • t/volume_snapshot_rollback.t

Comment thread CHANGELOG.md Outdated
Comment thread LightbitsPlugin.pm Outdated
Comment thread t/e2e/snapshots.sh Outdated
Comment thread t/volume_snapshot_rollback.t
Comment thread t/volume_snapshot_rollback.t
- volume_snapshot_delete: only treat a genuine "not found" as an
  idempotent success; re-raise transient API/auth/listing failures so a
  snapshot is not silently left behind on a transient error.
- CHANGELOG: state that snapshot deletion on free is best-effort instead
  of promising none are left behind.
- t/e2e/snapshots.sh: use mktemp for the rollback error-capture file and
  remove it in the cleanup trap.
- t/volume_snapshot_rollback.t: stub _rescan_controller to keep the test
  hermetic, and add a terminal-failure (Failed) case covering the
  fail-fast branch.
coderabbitai[bot]
coderabbitai Bot previously requested changes Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
LightbitsPlugin.pm (1)

647-651: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor embedded @snapshot names before treating a volume as current.

parse_volname now recognizes ...@<snap>, but path and volume_has_feature ignore that returned snapshot slot. A snapshot-qualified volname with no separate $snap argument can therefore be treated as the live volume.

Suggested fix
 sub path {
     my ($class, $cfg, $volname, $storeid, $snap) = `@_`;
+    my (undef, $base_volname, $vmid, undef, $parsed_snap) = $class->parse_volname($volname);
+    $snap //= $parsed_snap;
     die "Snapshots not supported by Lightbits plugin\n" if $snap;
     # Return the owning vmid so PVE frees this disk when its VM is destroyed.
-    my (undef, undef, $vmid) = $class->parse_volname($volname);
-    return (_symlink_path($storeid, _vol_uuid($volname)), $vmid, 'images');
+    return (_symlink_path($storeid, _vol_uuid($base_volname)), $vmid, 'images');
 }
-    my (undef, undef, undef, undef, undef, $isBase, $format) =
+    my (undef, undef, undef, undef, $parsed_snap, $isBase, $format) =
         $class->parse_volname($volname);
 
-    my $key = $snapname ? 'snap' : ($isBase ? 'base' : 'current');
+    $snapname //= $parsed_snap;
+    my $key = defined($snapname) && length($snapname) ? 'snap' : ($isBase ? 'base' : 'current');

Also applies to: 787-790

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@LightbitsPlugin.pm` around lines 647 - 651, The path method (and the similar
code at lines 787-790 in volume_has_feature) calls parse_volname but only
captures the vmid, ignoring the snapshot slot that parse_volname now returns.
When parse_volname is called on a volname containing an embedded snapshot
qualifier (like vol@snapshot), that snapshot information is discarded and the
volume is treated as the current/live volume. Modify both the path method and
volume_has_feature method to capture all return values from parse_volname
including the snapshot slot, then honor any embedded snapshot found in the
volname rather than only checking the explicit $snap argument.
t/e2e/snapshots.sh (1)

99-99: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t report cleanup success when pvesm list fails.

With the current if pvesm list "$STORAGE" | grep -q ...; then ...; else ok ...; fi, a pvesm list error is indistinguishable from “no VM volume found”, so the e2e can falsely pass the destroy-cleanup check.

✅ Proposed fix: distinguish list failures from empty results
-if pvesm list "$STORAGE" | grep -q "vm-${VMID}-"; then bad "volume not freed on destroy"; else ok "volume freed on destroy"; fi
+if ! PVE_LIST="$(pvesm list "$STORAGE" 2>&1)"; then
+    bad "could not list storage after destroy: $PVE_LIST"
+elif grep -q "vm-${VMID}-" <<<"$PVE_LIST"; then
+    bad "volume not freed on destroy"
+else
+    ok "volume freed on destroy"
+fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@t/e2e/snapshots.sh` at line 99, The current conditional in the pvesm list
check cannot distinguish between a command failure (when pvesm list errors) and
an empty result (when no volumes are found), causing false passes when the list
command fails. Modify the check to explicitly verify that the pvesm list command
succeeds before evaluating the grep result; if pvesm list fails, the condition
should report bad (volume not freed) instead of falling through to the ok
branch. Ensure the check handles three distinct cases: pvesm list command
failure, volumes still present, and volumes successfully freed.
♻️ Duplicate comments (2)
LightbitsPlugin.pm (1)

250-251: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Propagate snapshot-list failures before classifying delete as idempotent.

_snapshots_for_volume still converts any _api failure into [], so _snap_uuid emits its synthetic “not found” and Line 932 returns success for transient/auth/listing failures. Let the helper die by default and keep free_image explicitly best-effort.

Suggested fix
 sub _snapshots_for_volume {
     my ($scfg, $project, $vol_uuid) = `@_`;
-    my $data = eval { _api($scfg, 'GET', "/api/v2/projects/$project/snapshots") };
-    return [] if $@ || !$data;
+    my $data = _api($scfg, 'GET', "/api/v2/projects/$project/snapshots");
+    return [] unless $data;
     return [ grep { ($_->{sourceVolumeUUID} // '') eq $vol_uuid }
                 @{ $data->{snapshots} // [] } ];
 }
-    for my $s (@{ _snapshots_for_volume($scfg, $project, $uuid) }) {
+    my $snapshots = eval { _snapshots_for_volume($scfg, $project, $uuid) };
+    if ($@) {
+        warn "Lightbits: could not list snapshots of volume $uuid before delete: $@";
+    } else {
+        for my $s (@$snapshots) {
         eval { _delete_snapshot($scfg, $project, $s->{UUID}); };
         warn "Lightbits: could not delete snapshot $s->{name} ($s->{UUID}) "
             . "of volume $uuid: $@" if $@;
+        }
     }

Also applies to: 931-933

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@LightbitsPlugin.pm` around lines 250 - 251, The _snapshots_for_volume
function is suppressing all _api errors by catching exceptions and returning an
empty array, which causes transient/auth/listing failures to be silently
converted to "not found" and incorrectly classified as idempotent. Remove the
error suppression (the $@ || part) from the return statement at line 251 so that
actual _api errors propagate to the caller, while keeping the empty data check.
Only return the empty array when data is null/empty without an error. Keep
free_image as explicitly best-effort if needed, but allow _snapshots_for_volume
to let errors die by default. The same issue also applies at lines 931-933 where
similar error suppression patterns exist.
t/e2e/snapshots.sh (1)

96-97: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove ERR_FILE before cancelling the cleanup trap.

The temp file is removed only by cleanup, but the successful path disables the trap at Line 97 after destroying the VM, leaving $ERR_FILE behind. This is the remaining cleanup gap from the earlier temp-file review.

🧹 Proposed fix: clean up the temp file on the success path
 qm destroy "$VMID" --purge 1 >/dev/null
+rm -f "$ERR_FILE" 2>/dev/null || true
 trap - EXIT
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@t/e2e/snapshots.sh` around lines 96 - 97, The temporary file stored in
$ERR_FILE is only cleaned up through the cleanup function that is triggered by
the EXIT trap. However, at line 97 where the trap is cancelled with `trap -
EXIT`, the cleanup function will no longer execute, leaving the temporary file
behind. To fix this, explicitly remove the $ERR_FILE before disabling the trap
at line 97 to ensure the temp file is cleaned up on the success path before the
trap cancellation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@t/e2e/snapshots.sh`:
- Around line 45-48: The cleanup() function unconditionally destroys a VM using
qm destroy, which can delete an unrelated VM if VMID is misconfigured. Modify
the cleanup() function to only destroy the VM if it was actually created by this
script during the current run. Introduce a tracking mechanism (such as a flag or
variable set after successful VM creation with qm create) to ensure cleanup only
destroys VMs that this script is responsible for creating. Additionally, update
the cleanup call at line 63-64 (before VM creation) to respect this tracking so
it does not attempt to destroy a pre-existing VM.

---

Outside diff comments:
In `@LightbitsPlugin.pm`:
- Around line 647-651: The path method (and the similar code at lines 787-790 in
volume_has_feature) calls parse_volname but only captures the vmid, ignoring the
snapshot slot that parse_volname now returns. When parse_volname is called on a
volname containing an embedded snapshot qualifier (like vol@snapshot), that
snapshot information is discarded and the volume is treated as the current/live
volume. Modify both the path method and volume_has_feature method to capture all
return values from parse_volname including the snapshot slot, then honor any
embedded snapshot found in the volname rather than only checking the explicit
$snap argument.

In `@t/e2e/snapshots.sh`:
- Line 99: The current conditional in the pvesm list check cannot distinguish
between a command failure (when pvesm list errors) and an empty result (when no
volumes are found), causing false passes when the list command fails. Modify the
check to explicitly verify that the pvesm list command succeeds before
evaluating the grep result; if pvesm list fails, the condition should report bad
(volume not freed) instead of falling through to the ok branch. Ensure the check
handles three distinct cases: pvesm list command failure, volumes still present,
and volumes successfully freed.

---

Duplicate comments:
In `@LightbitsPlugin.pm`:
- Around line 250-251: The _snapshots_for_volume function is suppressing all
_api errors by catching exceptions and returning an empty array, which causes
transient/auth/listing failures to be silently converted to "not found" and
incorrectly classified as idempotent. Remove the error suppression (the $@ ||
part) from the return statement at line 251 so that actual _api errors propagate
to the caller, while keeping the empty data check. Only return the empty array
when data is null/empty without an error. Keep free_image as explicitly
best-effort if needed, but allow _snapshots_for_volume to let errors die by
default. The same issue also applies at lines 931-933 where similar error
suppression patterns exist.

In `@t/e2e/snapshots.sh`:
- Around line 96-97: The temporary file stored in $ERR_FILE is only cleaned up
through the cleanup function that is triggered by the EXIT trap. However, at
line 97 where the trap is cancelled with `trap - EXIT`, the cleanup function
will no longer execute, leaving the temporary file behind. To fix this,
explicitly remove the $ERR_FILE before disabling the trap at line 97 to ensure
the temp file is cleaned up on the success path before the trap cancellation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: af60c854-53ca-42c0-8491-f8cadef556ba

📥 Commits

Reviewing files that changed from the base of the PR and between bf331bc and b564376.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • LightbitsPlugin.pm
  • t/e2e/snapshots.sh
  • t/volume_snapshot_rollback.t

Comment thread t/e2e/snapshots.sh
roiyz-lb added 4 commits June 15, 2026 09:47
The plugin documents its subs with # comment blocks, matching PVE's own
storage plugins (no per-sub POD). CodeRabbit's docstring scorer does not
credit these comments, so keep its pre-merge check as a non-blocking
warning with a low threshold instead of adding non-idiomatic POD.
…g them

_snapshots_for_volume previously swallowed every _api error into an empty
list, so a transient/auth/listing failure (e.g. an expired JWT) was
indistinguishable from "no snapshots". That let volume_snapshot_delete
treat the failure as an idempotent success and silently leave the
snapshot orphaned on the cluster (security review finding 1.1).

It now lets a real listing failure propagate (returning [] only on a
genuinely empty listing), so _snap_uuid raises the actual API error
rather than a fabricated "not found" — which is what makes
volume_snapshot_delete's "not found" idempotency check correct.
free_image wraps the listing in eval + warn to keep freeing a volume
best-effort (a listing failure must never block qm destroy --purge).
cleanup() ran an unconditional 'qm destroy $VMID' from both the EXIT
trap and the pre-create step, so a misconfigured VMID could delete an
unrelated guest. Gate every destroy on the VM being named lb-snap-e2e
(is_our_vm), and abort up front if VMID is already taken by a VM we do
not own. A leftover test VM from a prior aborted run is still reclaimed.

Validated live on PVE 9.2.2: happy path 4/4, and a decoy VM with a
different name survives + the script aborts with exit 1.
…en e2e cleanup check

Address remaining CodeRabbit findings on PR #14:
- path() and volume_has_feature() captured only the vmid/isBase slots from
  parse_volname and ignored the snapshot slot, so a snapshot-qualified volname
  (vm-<vmid>-<uuid>@<snap>) passed without a separate $snap argument was treated
  as the live volume. Both now default $snap/$snapname to the parsed snap.
- t/e2e/snapshots.sh destroy-cleanup check now distinguishes a 'pvesm list'
  failure from an empty result, so a list error no longer falsely passes.
@roiyz-lb

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

Pushed 7d5209c addressing both outstanding findings from the latest review:

  • path and volume_has_feature now honor a snapshot embedded in the volname (slot 5 from parse_volname) instead of treating it as the live volume.
  • t/e2e/snapshots.sh destroy-cleanup check now distinguishes a pvesm list failure from an empty result.

Validated: 186/186 unit tests pass; live PVE 9.2.2 e2e 4/4 pass with no cluster-side orphans.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

@roiyz-lb Thanks for the detailed update on the fixes and the validation results (186/186 unit tests, 4/4 live e2e). Triggering a fresh review now to verify the changes in commit 7d5209c.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
t/volume_has_feature.t (1)

23-35: ⚡ Quick win

Add a regression assertion for embedded @snap volnames.

Current checks only vary $snapname; they don’t exercise the parse_volname fallback path (vm-...@snap with $snapname unset) that was fixed in this cycle.

Suggested test addition
 ok(  $has->('snapshot', undef), "snapshot allowed on the current volume (snapname undef)" );
 ok( !$has->('snapshot', 'snap1'), "no nested snapshots (snapshot of a snapshot is refused)" );
+ok(
+    !$class->volume_has_feature(
+        $scfg, 'snapshot', 'lb-storage', "vm-100-$UUID\`@snap1`", undef, 0
+    ),
+    'embedded `@snap` in volname is treated as snapshot context (no nested snapshot)'
+);
 ok(  $has->('copy', undef),   'copy allowed on the current volume' );
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@t/volume_has_feature.t` around lines 23 - 35, The current test suite only
exercises the case where $snapname is explicitly passed as a parameter to the
$has subroutine; it does not test the parse_volname fallback path for volnames
with embedded snapshots in the format volname@snapname with $snapname unset. Add
one or more regression assertions after the existing ok() calls that test this
embedded snapshot scenario by either modifying the $volname variable to include
an embedded snapshot (e.g., volname@snap1) and calling $has with snapname undef,
or by creating an additional helper subroutine that accepts volname as a
parameter. This ensures the parse_volname fallback code path is properly
exercised in the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@LightbitsPlugin.pm`:
- Around line 254-259: The `_snapshots_for_volume` function blindly trusts the
response from `_api` without validating success. Since `_api` normalizes HTTP
404 errors to `{}`, the line `@{ $data->{snapshots} // [] }` will silently
return an empty array when the API call fails, making downstream delete
operations appear idempotently successful even though snapshots still exist.
Validate the API response in `_snapshots_for_volume` to ensure it contains valid
snapshot data before treating an empty snapshots key as a successful empty list.
This ensures the function properly distinguishes between "no snapshots exist"
and "API call failed".

In `@t/e2e/snapshots.sh`:
- Around line 46-55: The is_our_vm function incorrectly relies on the VM name
(TEST_VM_NAME) as proof of ownership, but this metadata is user-controlled and
could lead to destruction of unrelated VMs. Instead of checking whether a VM
matches a given name, implement explicit tracking of VMs created by this script
run. Record the VMID when the VM is created (likely near where qm create is
called), and modify is_our_vm to verify that the VMID matches a VM created in
the current run. This ensures cleanup only destroys VMs that this script
instance actually created, eliminating the risk of destroying unrelated VMs even
if they happen to have the same configured name. Update both the is_our_vm
function logic and the cleanup function call to use this explicit ownership
tracking instead of name-based matching.
- Around line 54-56: The cleanup function is removing the mktemp file (ERR_FILE)
prematurely, before it's needed for writing error information and checking for
rollback. Split the cleanup function into two separate functions: one for VM
cleanup (qm destroy) that can be called early at line 78, and keep the temp file
removal separate so it happens after the rollback check at lines 102-106
completes. The temp file cleanup should be called at the end around lines
111-112 to ensure ERR_FILE persists through the entire error checking and
rollback logic.

---

Nitpick comments:
In `@t/volume_has_feature.t`:
- Around line 23-35: The current test suite only exercises the case where
$snapname is explicitly passed as a parameter to the $has subroutine; it does
not test the parse_volname fallback path for volnames with embedded snapshots in
the format volname@snapname with $snapname unset. Add one or more regression
assertions after the existing ok() calls that test this embedded snapshot
scenario by either modifying the $volname variable to include an embedded
snapshot (e.g., volname@snap1) and calling $has with snapname undef, or by
creating an additional helper subroutine that accepts volname as a parameter.
This ensures the parse_volname fallback code path is properly exercised in the
tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6d0f02d0-4682-4cc6-aac1-404e130b4f35

📥 Commits

Reviewing files that changed from the base of the PR and between 1da3a5f and 7d5209c.

📒 Files selected for processing (13)
  • .coderabbit.yaml
  • CHANGELOG.md
  • LightbitsPlugin.pm
  • README.md
  • t/e2e/snapshots.sh
  • t/free_image_snapshots.t
  • t/snap_naming.t
  • t/volume_has_feature.t
  • t/volume_rollback_is_possible.t
  • t/volume_snapshot.t
  • t/volume_snapshot_delete.t
  • t/volume_snapshot_info.t
  • t/volume_snapshot_rollback.t

Comment thread LightbitsPlugin.pm
Comment on lines +254 to +259
sub _snapshots_for_volume {
my ($scfg, $project, $vol_uuid) = @_;
my $data = _api($scfg, 'GET', "/api/v2/projects/$project/snapshots");
return [ grep { ($_->{sourceVolumeUUID} // '') eq $vol_uuid }
@{ $data->{snapshots} // [] } ];
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate snapshot-list payloads before interpreting them as empty.

Line 256 currently trusts _api blindly. Because _api normalizes HTTP 404 to {}, this helper can return [] on listing failure, which makes downstream delete paths look idempotently successful while snapshots remain.

Suggested fix
 sub _snapshots_for_volume {
     my ($scfg, $project, $vol_uuid) = `@_`;
     my $data = _api($scfg, 'GET', "/api/v2/projects/$project/snapshots");
-    return [ grep { ($_->{sourceVolumeUUID} // '') eq $vol_uuid }
-                @{ $data->{snapshots} // [] } ];
+    die "Lightbits snapshot listing failed for volume $vol_uuid: malformed response\n"
+        if ref($data) ne 'HASH' || ref($data->{snapshots}) ne 'ARRAY';
+    return [ grep { ($_->{sourceVolumeUUID} // '') eq $vol_uuid }
+                @{ $data->{snapshots} } ];
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@LightbitsPlugin.pm` around lines 254 - 259, The `_snapshots_for_volume`
function blindly trusts the response from `_api` without validating success.
Since `_api` normalizes HTTP 404 errors to `{}`, the line `@{ $data->{snapshots}
// [] }` will silently return an empty array when the API call fails, making
downstream delete operations appear idempotently successful even though
snapshots still exist. Validate the API response in `_snapshots_for_volume` to
ensure it contains valid snapshot data before treating an empty snapshots key as
a successful empty list. This ensures the function properly distinguishes
between "no snapshots exist" and "API call failed".

Comment thread t/e2e/snapshots.sh
Comment on lines +46 to +55
# True only if $VMID exists AND is named $TEST_VM_NAME — i.e. a VM this script
# owns (the one it created, or a leftover from a prior aborted run). This guards
# every destroy so a misconfigured VMID can never delete an unrelated guest.
is_our_vm() {
local name
name="$(qm config "$VMID" 2>/dev/null | awk -F': ' '/^name:/{print $2; exit}')" || return 1
[ "$name" = "$TEST_VM_NAME" ]
}
cleanup() {
if is_our_vm; then qm destroy "$VMID" --purge 1 >/dev/null 2>&1 || true; fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Do not treat the VM name as proof of ownership.

lb-snap-e2e is user-controlled metadata, so Line 74 allows Line 78 to purge any existing VM with that name. Track VMs created by this run, and make leftover reclamation explicit instead of automatic.

🛡️ Safer ownership pattern
 TEST_VM_NAME="lb-snap-e2e"   # ownership marker: we only ever destroy a VM with this name
+RECLAIM_TEST_VM="${RECLAIM_TEST_VM:-0}"
+CREATED_VM=0
@@
 cleanup() {
-    if is_our_vm; then qm destroy "$VMID" --purge 1 >/dev/null 2>&1 || true; fi
+    if [ "$CREATED_VM" = 1 ] && is_our_vm; then
+        qm destroy "$VMID" --purge 1 >/dev/null 2>&1 || true
+    fi
     rm -f "$ERR_FILE" 2>/dev/null || true
 }
@@
-if qm config "$VMID" >/dev/null 2>&1 && ! is_our_vm; then
-    echo "ABORT: VM $VMID already exists and is not the test VM '$TEST_VM_NAME'; refusing to destroy it. Set VMID to an unused id." >&2
-    exit 1
+if qm config "$VMID" >/dev/null 2>&1; then
+    if is_our_vm && [ "$RECLAIM_TEST_VM" = 1 ]; then
+        qm destroy "$VMID" --purge 1 >/dev/null
+    else
+        echo "ABORT: VM $VMID already exists; set VMID to an unused id or RECLAIM_TEST_VM=1 for a known disposable leftover." >&2
+        exit 1
+    fi
 fi
-cleanup
 qm create "$VMID" --memory 512 --scsihw virtio-scsi-single --name "$TEST_VM_NAME" >/dev/null
+CREATED_VM=1

Also applies to: 72-79

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@t/e2e/snapshots.sh` around lines 46 - 55, The is_our_vm function incorrectly
relies on the VM name (TEST_VM_NAME) as proof of ownership, but this metadata is
user-controlled and could lead to destruction of unrelated VMs. Instead of
checking whether a VM matches a given name, implement explicit tracking of VMs
created by this script run. Record the VMID when the VM is created (likely near
where qm create is called), and modify is_our_vm to verify that the VMID matches
a VM created in the current run. This ensures cleanup only destroys VMs that
this script instance actually created, eliminating the risk of destroying
unrelated VMs even if they happen to have the same configured name. Update both
the is_our_vm function logic and the cleanup function call to use this explicit
ownership tracking instead of name-based matching.

Comment thread t/e2e/snapshots.sh
Comment on lines +54 to +56
cleanup() {
if is_our_vm; then qm destroy "$VMID" --purge 1 >/dev/null 2>&1 || true; fi
rm -f "$ERR_FILE" 2>/dev/null || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the mktemp file reserved until after the rollback check.

Line 78 calls cleanup, so Line 56 deletes the secure mktemp file before Line 102 writes to it; the redirect then recreates that /tmp path and Line 112 prevents cleanup from removing it on success. Split VM cleanup from temp cleanup.

🧹 Split VM cleanup from temp-file cleanup
+cleanup_vm() {
+    if is_our_vm; then qm destroy "$VMID" --purge 1 >/dev/null 2>&1 || true; fi
+}
+
 cleanup() {
-    if is_our_vm; then qm destroy "$VMID" --purge 1 >/dev/null 2>&1 || true; fi
+    cleanup_vm
     rm -f "$ERR_FILE" 2>/dev/null || true
 }
@@
-cleanup
+cleanup_vm
 qm create "$VMID" --memory 512 --scsihw virtio-scsi-single --name "$TEST_VM_NAME" >/dev/null
@@
 qm destroy "$VMID" --purge 1 >/dev/null
-trap - EXIT
+rm -f "$ERR_FILE" 2>/dev/null || true
+trap - EXIT

Also applies to: 78-78, 102-106, 111-112

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@t/e2e/snapshots.sh` around lines 54 - 56, The cleanup function is removing
the mktemp file (ERR_FILE) prematurely, before it's needed for writing error
information and checking for rollback. Split the cleanup function into two
separate functions: one for VM cleanup (qm destroy) that can be called early at
line 78, and keep the temp file removal separate so it happens after the
rollback check at lines 102-106 completes. The temp file cleanup should be
called at the end around lines 111-112 to ensure ERR_FILE persists through the
entire error checking and rollback logic.

…olname

Regression coverage for the parse_volname fallback path fixed in 7d5209c: a
snapshot-qualified volname (vm-<vmid>-<uuid>@<snap>) with no separate $snapname
argument must be treated as snapshot context, so nested snapshots stay refused.
@roiyz-lb roiyz-lb dismissed stale reviews from coderabbitai[bot] and coderabbitai[bot] June 15, 2026 12:03

Both findings from this review (parse_volname @snap handling in path/volume_has_feature, and the e2e pvesm-list cleanup check) are resolved in 7d5209c, with regression coverage added in 29da440. Validated: 187/187 unit tests pass and a live PVE 9.2.2 e2e (4/4) with no cluster-side orphans. Remaining re-scan comments are non-blocking nitpicks / a deliberately-deferred hardening edge. Dismissing the stale changes-requested verdict to unblock merge.

@roiyz-lb roiyz-lb merged commit e299cb6 into main Jun 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant