Skip to content

[codex] docs: add Slurm academic sandbox plan and adapter#345

Open
zozo123 wants to merge 4 commits into
openclaw:mainfrom
zozo123:codex/slurm-sandbox-provider
Open

[codex] docs: add Slurm academic sandbox plan and adapter#345
zozo123 wants to merge 4 commits into
openclaw:mainfrom
zozo123:codex/slurm-sandbox-provider

Conversation

@zozo123

@zozo123 zozo123 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds Slurm academic sandbox guidance that treats Slurm as a scheduler allocation system, not a VM/sandbox provider, and recommends provider: external as the first safe campus offer.
  • Adds examples/slurm-external-provider/, a reference Python external-provider adapter plus sample unprivileged sshd runner for campus pilots.
  • Cross-links the Slurm guidance from the main docs index, feature index, and bring-your-own infrastructure docs.

What the adapter does

  • Implements the external provider operations for doctor, acquire, resolve, list, release, touch, and cleanup.
  • Submits leases through sbatch --parsable, records the Slurm job as cloudId: slurm/job/<id>, watches squeue/optional sacct, and releases with scancel.
  • Waits for a scheduled allocation to publish an SSH endpoint, then returns the normal Crabbox external-provider SSH shape: host, port, user, key, proxy command, readiness check, and labels.
  • Keeps Slurm-specific policy in the adapter/config: account, partition, QOS, CPUs, memory, wall time, GRES, constraints, login-host proxying, runner script, and site work root.
  • Ships the runner only as a replaceable example; real campuses can swap in a site gateway, Open OnDemand-style connect script, Apptainer/Singularity wrapper, Pyxis/Enroot policy, or managed SSH service.

Research notes

  • SchedMD documents sbatch as accepting a script before an allocation may actually run, so the docs recommend warmup to absorb scheduler queue time.
  • SchedMD documents squeue for live job state and scancel for cancellation, which maps cleanly to external-provider list/status/release behavior.
  • SchedMD's REST guidance says slurmrestd is not designed to be directly internet-facing, so the docs keep Slurm control-plane access site-local.
  • JupyterHub BatchSpawner and Open OnDemand reinforce the same academic pattern: site-owned scheduler templates submit jobs, then expose an interactive endpoint while local policy stays local.

References:

Closes #325

Validation

  • python3 -m py_compile examples/slurm-external-provider/slurm-cbx.py
  • bash -n examples/slurm-external-provider/runner-unprivileged-sshd.sh
  • git diff --check
  • Fake-Slurm protocol smoke for acquire, list, and release against stubbed sbatch, squeue, and scancel
  • scripts/check-docs.sh

@clawsweeper

clawsweeper Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed June 14, 2026, 4:30 AM ET / 08:30 UTC.

Summary
The PR updates docs indexes and BYO guidance, adds Slurm academic sandbox docs, and adds a Python Slurm external-provider example with a shell SSH runner and pytest tests.

Reproducibility: not applicable. this is a feature/docs/example PR, not a bug report. Source review confirms current main lacks the requested Slurm surface and the PR adds it.

Review metrics: 2 noteworthy metrics.

  • Changed surface: 9 files, +1698/-1. The PR spans docs, example runtime code, shell runner behavior, and tests rather than a small documentation-only edit.
  • Runtime example code: 2 added scripts, 792 lines. The adapter and runner introduce Slurm process-management and SSH endpoint behavior that needs cleanup-safety review and real proof.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Add redacted terminal output or logs from a real Slurm or representative external-provider acquire/list/release run; redact IPs, keys, non-public endpoints, and other private details.
  • Preserve state on inconclusive scheduler status or scancel failures and cover those paths in the fake-Slurm tests.
  • [P1] Validate proxy/config requirements in doctor before any Slurm job can be submitted.

Proof guidance:

  • [P1] Needs real behavior proof before merge: The PR body and tests show syntax checks plus fake-Slurm coverage only; a contributor still needs redacted live Slurm or representative external-provider acquire/list/release output before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] Real behavior proof is still mock-only; no redacted live Slurm or representative external-provider acquire/list/release output shows the adapter working after the change.
  • [P1] The adapter can lose local state after inconclusive scheduler or cancellation results, which could leave a Slurm allocation and SSH endpoint running until wall time without a reliable Crabbox stop path.
  • [P1] The linked planning issue remains open with a product-decision label, so maintainers still need to approve the public Slurm external-adapter contract before merge.

Maintainer options:

  1. Fix adapter safety before merge (recommended)
    Preserve local state whenever Slurm status or cancellation is inconclusive, validate proxy/config requirements during doctor, and add focused fake-Slurm coverage for those paths.
  2. Accept example-only cleanup risk
    Maintainers may choose to treat the adapter as forkable sample code, but should explicitly accept that ambiguous Slurm control-plane failures can orphan allocations until wall time.
  3. Pause until live proof exists
    If no live or representative Slurm proof can be supplied, pause this PR and keep the planning issue open until a campus pilot can validate the contract.

Next step before merge

  • [P1] Contributor proof and maintainer contract approval are needed before merge, and the current proof gate means ClawSweeper should not queue an automated repair marker for this PR.

Security
Needs attention: The executable example introduces SSH and Slurm cleanup behavior, and the concrete concern is losing the stop path after inconclusive scheduler or cancellation results.

Review findings

  • [P2] Preserve state when Slurm status is unknown — examples/slurm-external-provider/slurm-cbx.py:431-432
  • [P2] Do not discard state after an ambiguous scancel failure — examples/slurm-external-provider/slurm-cbx.py:479-482
  • [P2] Validate proxy mode in doctor — examples/slurm-external-provider/slurm-cbx.py:122-129
Review details

Best possible solution:

Land a maintainer-approved external-provider-first Slurm contract only after the example adapter preserves state on cancellation/status uncertainty, validates config in doctor before job submission, and has redacted real-run proof.

Do we have a high-confidence way to reproduce the issue?

Not applicable; this is a feature/docs/example PR, not a bug report. Source review confirms current main lacks the requested Slurm surface and the PR adds it.

Is this the best way to solve the issue?

No as submitted; the external-provider-first direction matches the repository boundary, but the adapter needs cleanup-safety fixes, earlier config validation, maintainer contract approval, and real-run proof before it is the best merge path.

Full review comments:

  • [P2] Preserve state when Slurm status is unknown — examples/slurm-external-provider/slurm-cbx.py:431-432
    query_job_state returns an empty string both when a job is truly absent and when squeue/sacct fail or are unavailable, but refresh_state turns that into missing; cleanup can then delete the only local state for a live allocation. Distinguish unknown query failures from proven terminal/absent jobs before marking state cleanup-safe.
    Confidence: 0.87
  • [P2] Do not discard state after an ambiguous scancel failure — examples/slurm-external-provider/slurm-cbx.py:479-482
    When scancel exits nonzero and the follow-up state lookup returns no state, cancel_job falls through as success; release then removes the job directory. Keep the state and return an error unless cancellation, terminal state, or proven absence is established.
    Confidence: 0.88
  • [P2] Validate proxy mode in doctor — examples/slurm-external-provider/slurm-cbx.py:122-129
    The docs tell users to run crabbox doctor before warmup, but doctor only checks Slurm commands and the runner path. A config with sshMode=proxy-through-login and no loginHost passes doctor and fails only after acquire submits a job, so validate the proxy/config requirements before scheduling work.
    Confidence: 0.82

Overall correctness: patch is incorrect
Overall confidence: 0.86

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against ccc27374948c.

Label changes

Label justifications:

  • P3: This is a low-urgency docs/example-provider feature for a speculative Slurm integration path, not a current runtime regression.
  • merge-risk: 🚨 availability: The added adapter can erase local state after inconclusive Slurm status or cancellation results, leaving scheduled work without a reliable Crabbox stop path.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body and tests show syntax checks plus fake-Slurm coverage only; a contributor still needs redacted live Slurm or representative external-provider acquire/list/release output before merge. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

Security concerns:

  • [medium] Do not drop cleanup state after inconclusive Slurm results — examples/slurm-external-provider/slurm-cbx.py:479
    A failed scancel or failed scheduler query can be treated as safe absence, after which the adapter removes local state; that can leave an SSH-enabled Slurm allocation running without a reliable retry path.
    Confidence: 0.85

What I checked:

  • Repository policy read and applied: AGENTS.md was read fully; its provider-neutral boundary and security/config guidance applies because this PR adds provider-adjacent Slurm example code outside core. (AGENTS.md:13, ccc27374948c)
  • No maintainer notes found: There is no .agents/maintainer-notes directory in this checkout, so no matching maintainer note affected the review. (ccc27374948c)
  • Current main lacks Slurm surface: A current-main search for Slurm, sbatch, squeue, scancel, academic sandboxes, and slurm-external-provider returned no hits, so the central change is not already implemented on main. (ccc27374948c)
  • External-provider boundary supports the direction: Current external-provider docs say external tools own provisioning, inventory, resume, release, and private authentication while Crabbox owns sync, commands, results, and SSH sessions. (docs/providers/external.md:3, ccc27374948c)
  • PR scope includes executable behavior: The PR changes 9 files with +1698/-1 and adds a Python Slurm adapter, shell runner, and pytest suite, so this is not docs-only and needs runtime proof/security review. (e8b3a8d74b5e)
  • Inconclusive scancel can still discard state: At current head, a nonzero scancel only raises when a follow-up state query returns a non-terminal state; if the query returns no state, release proceeds to remove the job directory. (examples/slurm-external-provider/slurm-cbx.py:479, e8b3a8d74b5e)

Likely related people:

  • steipete: Blame and log show this person introduced the external-provider docs/backend and recently expanded the controller-capable external-provider contract this PR builds on. (role: external-provider architecture contributor; confidence: high; commits: 9e208c80cd1a, 81ef2a83be56, adc6d8da9cb9; files: docs/providers/external.md, docs/features/bring-your-own-infrastructure.md, internal/providers/external/backend.go)
  • Vincent Koc: History search shows recent fixes in external-provider rollback, secret, and requested-slug cleanup semantics adjacent to the adapter release and state-safety concerns. (role: external-provider cleanup contributor; confidence: medium; commits: 17a29bbe27a9, 801f2ea6f494, 859ed2736542; files: internal/providers/external/backend.go, docs/providers/external.md)
  • zozo123: Beyond authoring this PR, prior merged history shows provider and delegated-run contributions in nearby provider surfaces, and the linked issue/PR are the active Slurm proposal. (role: adjacent provider contributor and proposal owner; confidence: medium; commits: 8dfa7c348551, 8b246f6f96b7, 3b92643ab361; files: internal/providers, docs, internal/cli)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P3 Low-risk cleanup, docs, polish, ergonomics, or speculative feature. labels Jun 14, 2026
@zozo123 zozo123 changed the title [codex] docs: plan Slurm academic sandboxes [codex] docs: add Slurm academic sandbox plan and adapter Jun 14, 2026

zozo123 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

The PR now includes the reference Slurm external-provider adapter and sample runner in addition to the docs/product contract.

@clawsweeper

clawsweeper Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 14, 2026
zozo123 and others added 2 commits June 14, 2026 11:00
…tests

Take the Slurm reference adapter from draft to mergeable:

- Stop emitting the reserved routing label "slug" from lease responses; the
  broker rejects leases that set reserved labels (lease/slug/name/
  externalResourceName/externalResourceNameFromEnv), most visibly when
  idempotentLeaseId is enabled for acquire and list. Namespace adapter labels
  (state, slurmJobId, slurmState) instead.
- Parse sbatch --parsable output safely: tolerate empty stdout and the
  "<jobid>;<cluster>" form without IndexError.
- Parse sacct State robustly (strip "CANCELLED by <uid>" qualifiers, skip blank
  step rows) without indexing an empty split.
- Make list/cleanup/find_state resilient to corrupt or partially written
  state.json files via a shared iter_states() helper that skips unreadable
  files instead of aborting the whole operation.
- Add argparse help text and reject negative --poll-interval.

Add test_slurm_cbx.py (pytest) with sbatch/squeue/sacct/scancel mocked,
covering doctor, the acquire -> resolve -> release happy path, idempotent
re-acquire, lease-id parsing, endpoint-timeout cancellation and keep behavior,
list filtering, cleanup, expected-identity mismatch, proxy-through-login,
secret redaction, corrupt-state resilience, and the stdin main() doctor path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Add `python3 -m pytest examples/slurm-external-provider/ -q` to the local
  checks in the feature doc and example README, and describe what the suite
  covers.
- Ignore __pycache__, *.pyc, and .pytest_cache so test runs do not dirty the
  tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@zozo123 zozo123 marked this pull request as ready for review June 14, 2026 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. P3 Low-risk cleanup, docs, polish, ergonomics, or speculative feature. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plan on-prem / Slurm sandboxes for academia

1 participant