Skip to content

fix(alloc): fail fast when a new volume never becomes Available#10

Merged
roiyz-lb merged 3 commits into
mainfrom
feat/alloc-failfast
Jun 11, 2026
Merged

fix(alloc): fail fast when a new volume never becomes Available#10
roiyz-lb merged 3 commits into
mainfrom
feat/alloc-failfast

Conversation

@roiyz-lb

@roiyz-lb roiyz-lb commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

alloc_image polled 30× for state Available and then returned the volid regardless of the final state. If a volume landed in a terminal Failed state — e.g. when the LightOS cluster is unhealthy — it was reported as successfully created, and the failure only surfaced later (and cryptically) when activate_volume couldn't determine the volume's NSID (Cannot determine NSID for volume ...).

This change makes the poll loop fail fast with a clear, actionable error:

  • dies immediately on a terminal cluster state (Failed/Deleting/Deleted), and
  • dies on timeout if the volume never reaches Available,
  • including the volume name, UUID, and last-seen state in the message.

Why now

Observed live: a backing-storage blip put the test LightOS cluster into state: Error, new volumes provisioned as Failed/nsid=0, and qm operations failed with the unhelpful NSID error instead of a clear "volume creation failed on the cluster" message at create time.

Testing

  • New t/alloc_failfast.t: happy path (Available → volid), terminal failure (Failed → dies), and timeout (stuck Creating → dies). No-op sleep so it runs instantly. Full suite: 65 tests pass.
  • Integration on PVE 9.2.2 + LightOS (healthy): pvesm alloclistfree happy path confirmed, no regression.

Notes

  • Independent of feat/volume-resize (PR feat(resize): support growing Lightbits volumes via qm resize #9); branched off main. The same silent-timeout class was also fixed in that PR's volume_resize.
  • Leaves any already-Failed volume on the cluster for the operator to inspect/clean (no auto-delete, to avoid masking the original failure).

Summary by CodeRabbit

  • Bug Fixes

    • Volume creation now fails fast with clear errors when a new volume enters terminal states or never becomes Available, avoiding unusable IDs and subsequent attach-time errors.
  • Documentation

    • Changelog updated to describe the improved volume creation error reporting.
  • Tests

    • New tests verify fail-fast behavior for terminal states and timeout/no-convergence scenarios.

alloc_image polled 30x for state 'Available' then returned the volid
regardless — so a volume that landed in a terminal 'Failed' state (e.g.
when the LightOS cluster is unhealthy) was reported as successfully
created, and the problem only surfaced later, cryptically, when
activate_volume could not determine the volume's NSID.

Track the last observed state and die with the volume name, UUID, and
state if it hits a terminal failure (Failed/Deleting/Deleted) or never
reaches Available within the timeout.

Add t/alloc_failfast.t covering the happy path, terminal-failure, and
timeout cases (no-op sleep so the poll runs instantly).
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e3299862-6c55-45ad-88d7-c9508250b65f

📥 Commits

Reviewing files that changed from the base of the PR and between 295d3cd and 5f40a55.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • LightbitsPlugin.pm

📝 Walkthrough

Walkthrough

alloc_image now polls a created Lightbits volume, returns only on state Available, fails immediately on terminal states (Failed, Deleting, Deleted), and times out with the last-seen state reported. A new test and CHANGELOG entry validate and document this behavior.

Changes

Failure-fast volume allocation

Layer / File(s) Summary
alloc_image state-checking and error reporting
LightbitsPlugin.pm, CHANGELOG.md
After volume creation, polling now captures the cluster-reported state each iteration, returns early on Available, dies immediately on terminal states (Failed, Deleting, Deleted), and dies with timeout message including the last-seen state if retries exhaust.
alloc_failfast test suite
t/alloc_failfast.t
Test infrastructure stubs plugin API responses and overrides sleep to control and verify volume lifecycle. Tests assert: success when volume reaches Available; failure-fast when volume enters Failed, Deleting, or Deleted; and timeout with last-state reporting when volume remains Creating.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(alloc): fail fast when a new volume never becomes Available' clearly and specifically summarizes the main change: the alloc_image function now fails fast instead of silently timing out when volumes don't become available.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@t/alloc_failfast.t`:
- Around line 46-51: Add tests for the other terminal states by duplicating the
existing Failed-state test block in t/alloc_failfast.t and changing $get_state
to 'Deleting' and 'Deleted' respectively; keep the same call to
$class->alloc_image('lb-storage', $scfg, 100, 'raw', undef, 1048576) and the
same assertions: ok(!$ok, ...) and like($@, qr/\Q$UUID\E/, ...) and like($@,
qr/state 'Deleting'/ or qr/state 'Deleted'/) so the suite verifies alloc_image
dies and the error message includes the volume UUID and the terminal state for
each: 'Failed', 'Deleting', and 'Deleted'.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 49758c9c-7c4c-49c8-b71d-eb689e6ee2f4

📥 Commits

Reviewing files that changed from the base of the PR and between f9a342d and 5f792ce.

📒 Files selected for processing (3)
  • CHANGELOG.md
  • LightbitsPlugin.pm
  • t/alloc_failfast.t

Comment thread t/alloc_failfast.t
alloc_image treats Failed/Deleting/Deleted as terminal, but the test only
exercised Failed. Add Deleting and Deleted cases (same die + UUID + state
assertions) so all terminal states the code branches on are covered.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@LightbitsPlugin.pm`:
- Around line 585-592: The resize-poll loop in the block using _api($scfg,
'GET', "/api/v2/volumes/$uuid?projectName=$project") should detect terminal
volume states (e.g., 'Failed', 'Deleting') each iteration and fail fast instead
of sleeping until timeout; after obtaining $vol and setting $cur and $state in
the for my $attempt loop, add a check for those terminal states and bail out
immediately (die or return an error) with a clear message referencing $uuid and
$state, mirroring the fail-fast behavior used by alloc_image.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e3299862-6c55-45ad-88d7-c9508250b65f

📥 Commits

Reviewing files that changed from the base of the PR and between 295d3cd and 5f40a55.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • LightbitsPlugin.pm

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@LightbitsPlugin.pm`:
- Around line 585-592: The resize-poll loop in the block using _api($scfg,
'GET', "/api/v2/volumes/$uuid?projectName=$project") should detect terminal
volume states (e.g., 'Failed', 'Deleting') each iteration and fail fast instead
of sleeping until timeout; after obtaining $vol and setting $cur and $state in
the for my $attempt loop, add a check for those terminal states and bail out
immediately (die or return an error) with a clear message referencing $uuid and
$state, mirroring the fail-fast behavior used by alloc_image.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e3299862-6c55-45ad-88d7-c9508250b65f

📥 Commits

Reviewing files that changed from the base of the PR and between 295d3cd and 5f40a55.

📒 Files selected for processing (2)
  • CHANGELOG.md
  • LightbitsPlugin.pm
🛑 Comments failed to post (1)
LightbitsPlugin.pm (1)

585-592: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider failing fast on terminal volume states during resize polling.

The resize polling loop currently waits up to 2 minutes before reporting a failure, even if the volume enters a terminal state like Failed or Deleting. For consistency with alloc_image (lines 432-434) and better user experience, consider checking for terminal states in each iteration and failing immediately.

Suggested enhancement
 for my $attempt (1..60) {
     my $vol = _api($scfg, 'GET', "/api/v2/volumes/$uuid?projectName=$project");
     $cur    = int($vol->{size} // 0);
     $state  = $vol->{state} // '';
     last if $cur >= $bytes && $state eq 'Available';
+    die "Lightbits volume $uuid resize failed on the cluster (state '$state')\n"
+        if $state =~ /^(Failed|Deleting|Deleted)$/i;
     sleep 2;
 }

This would report resize failures immediately rather than after the 2-minute timeout, matching the fail-fast behavior now present in alloc_image.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    my ($cur, $state) = (0, '');
    for my $attempt (1..60) {
        my $vol = _api($scfg, 'GET', "/api/v2/volumes/$uuid?projectName=$project");
        $cur    = int($vol->{size} // 0);
        $state  = $vol->{state} // '';
        last if $cur >= $bytes && $state eq 'Available';
        die "Lightbits volume $uuid resize failed on the cluster (state '$state')\n"
            if $state =~ /^(Failed|Deleting|Deleted)$/i;
        sleep 2;
    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@LightbitsPlugin.pm` around lines 585 - 592, The resize-poll loop in the block
using _api($scfg, 'GET', "/api/v2/volumes/$uuid?projectName=$project") should
detect terminal volume states (e.g., 'Failed', 'Deleting') each iteration and
fail fast instead of sleeping until timeout; after obtaining $vol and setting
$cur and $state in the for my $attempt loop, add a check for those terminal
states and bail out immediately (die or return an error) with a clear message
referencing $uuid and $state, mirroring the fail-fast behavior used by
alloc_image.

@roiyz-lb roiyz-lb merged commit e493e61 into main Jun 11, 2026
2 checks passed
@roiyz-lb roiyz-lb deleted the feat/alloc-failfast branch June 11, 2026 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant