fix(alloc): fail fast when a new volume never becomes Available#10
Conversation
alloc_image polled 30x for state 'Available' then returned the volid regardless — so a volume that landed in a terminal 'Failed' state (e.g. when the LightOS cluster is unhealthy) was reported as successfully created, and the problem only surfaced later, cryptically, when activate_volume could not determine the volume's NSID. Track the last observed state and die with the volume name, UUID, and state if it hits a terminal failure (Failed/Deleting/Deleted) or never reaches Available within the timeout. Add t/alloc_failfast.t covering the happy path, terminal-failure, and timeout cases (no-op sleep so the poll runs instantly).
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughalloc_image now polls a created Lightbits volume, returns only on state Available, fails immediately on terminal states (Failed, Deleting, Deleted), and times out with the last-seen state reported. A new test and CHANGELOG entry validate and document this behavior. ChangesFailure-fast volume allocation
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@t/alloc_failfast.t`:
- Around line 46-51: Add tests for the other terminal states by duplicating the
existing Failed-state test block in t/alloc_failfast.t and changing $get_state
to 'Deleting' and 'Deleted' respectively; keep the same call to
$class->alloc_image('lb-storage', $scfg, 100, 'raw', undef, 1048576) and the
same assertions: ok(!$ok, ...) and like($@, qr/\Q$UUID\E/, ...) and like($@,
qr/state 'Deleting'/ or qr/state 'Deleted'/) so the suite verifies alloc_image
dies and the error message includes the volume UUID and the terminal state for
each: 'Failed', 'Deleting', and 'Deleted'.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 49758c9c-7c4c-49c8-b71d-eb689e6ee2f4
📒 Files selected for processing (3)
CHANGELOG.mdLightbitsPlugin.pmt/alloc_failfast.t
alloc_image treats Failed/Deleting/Deleted as terminal, but the test only exercised Failed. Add Deleting and Deleted cases (same die + UUID + state assertions) so all terminal states the code branches on are covered.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@LightbitsPlugin.pm`:
- Around line 585-592: The resize-poll loop in the block using _api($scfg,
'GET', "/api/v2/volumes/$uuid?projectName=$project") should detect terminal
volume states (e.g., 'Failed', 'Deleting') each iteration and fail fast instead
of sleeping until timeout; after obtaining $vol and setting $cur and $state in
the for my $attempt loop, add a check for those terminal states and bail out
immediately (die or return an error) with a clear message referencing $uuid and
$state, mirroring the fail-fast behavior used by alloc_image.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e3299862-6c55-45ad-88d7-c9508250b65f
📒 Files selected for processing (2)
CHANGELOG.mdLightbitsPlugin.pm
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@LightbitsPlugin.pm`:
- Around line 585-592: The resize-poll loop in the block using _api($scfg,
'GET', "/api/v2/volumes/$uuid?projectName=$project") should detect terminal
volume states (e.g., 'Failed', 'Deleting') each iteration and fail fast instead
of sleeping until timeout; after obtaining $vol and setting $cur and $state in
the for my $attempt loop, add a check for those terminal states and bail out
immediately (die or return an error) with a clear message referencing $uuid and
$state, mirroring the fail-fast behavior used by alloc_image.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e3299862-6c55-45ad-88d7-c9508250b65f
📒 Files selected for processing (2)
CHANGELOG.mdLightbitsPlugin.pm
🛑 Comments failed to post (1)
LightbitsPlugin.pm (1)
585-592: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win
Consider failing fast on terminal volume states during resize polling.
The resize polling loop currently waits up to 2 minutes before reporting a failure, even if the volume enters a terminal state like
FailedorDeleting. For consistency withalloc_image(lines 432-434) and better user experience, consider checking for terminal states in each iteration and failing immediately.Suggested enhancement
for my $attempt (1..60) { my $vol = _api($scfg, 'GET', "/api/v2/volumes/$uuid?projectName=$project"); $cur = int($vol->{size} // 0); $state = $vol->{state} // ''; last if $cur >= $bytes && $state eq 'Available'; + die "Lightbits volume $uuid resize failed on the cluster (state '$state')\n" + if $state =~ /^(Failed|Deleting|Deleted)$/i; sleep 2; }This would report resize failures immediately rather than after the 2-minute timeout, matching the fail-fast behavior now present in
alloc_image.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.my ($cur, $state) = (0, ''); for my $attempt (1..60) { my $vol = _api($scfg, 'GET', "/api/v2/volumes/$uuid?projectName=$project"); $cur = int($vol->{size} // 0); $state = $vol->{state} // ''; last if $cur >= $bytes && $state eq 'Available'; die "Lightbits volume $uuid resize failed on the cluster (state '$state')\n" if $state =~ /^(Failed|Deleting|Deleted)$/i; sleep 2; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@LightbitsPlugin.pm` around lines 585 - 592, The resize-poll loop in the block using _api($scfg, 'GET', "/api/v2/volumes/$uuid?projectName=$project") should detect terminal volume states (e.g., 'Failed', 'Deleting') each iteration and fail fast instead of sleeping until timeout; after obtaining $vol and setting $cur and $state in the for my $attempt loop, add a check for those terminal states and bail out immediately (die or return an error) with a clear message referencing $uuid and $state, mirroring the fail-fast behavior used by alloc_image.
Summary
alloc_imagepolled 30× for stateAvailableand then returned the volid regardless of the final state. If a volume landed in a terminalFailedstate — e.g. when the LightOS cluster is unhealthy — it was reported as successfully created, and the failure only surfaced later (and cryptically) whenactivate_volumecouldn't determine the volume's NSID (Cannot determine NSID for volume ...).This change makes the poll loop fail fast with a clear, actionable error:
Failed/Deleting/Deleted), andAvailable,Why now
Observed live: a backing-storage blip put the test LightOS cluster into
state: Error, new volumes provisioned asFailed/nsid=0, andqmoperations failed with the unhelpful NSID error instead of a clear "volume creation failed on the cluster" message at create time.Testing
t/alloc_failfast.t: happy path (Available → volid), terminal failure (Failed → dies), and timeout (stuck Creating → dies). No-opsleepso it runs instantly. Full suite: 65 tests pass.pvesm alloc→list→freehappy path confirmed, no regression.Notes
feat/volume-resize(PR feat(resize): support growing Lightbits volumes via qm resize #9); branched offmain. The same silent-timeout class was also fixed in that PR'svolume_resize.Failedvolume on the cluster for the operator to inspect/clean (no auto-delete, to avoid masking the original failure).Summary by CodeRabbit
Bug Fixes
Documentation
Tests