sagequeue runs long SageMath experiments inside a rootless Podman container on Ubuntu (WSL2) using podman-compose, and executes partitions of the workload via a durable on-disk queue with workers supervised by systemd --user.
Primary target workload: stride/offset partitioned runs of rank_boundary_sat_v18.sage (e.g. STRIDE=8, offsets 0..7) for mod-2 boundary rank distribution experiments (e.g. Shrikhande graph, then rook graph).
Top-level:
Containerfile— builds the local Sage image used bypodman-compose.yml(includespycryptosatfor CryptoMiniSat).podman-compose.yml— runs thesagemathcontainer and exposes Jupyter on port 8888.Makefile— jobset selection, queue operations, and systemd--userservice control.config/*.mk— jobset configs (Shrikhande and rook live here).systemd/— user unit files installed into~/.config/systemd/user/(not/etc/systemd/system).var/— durable queue + logs (gitignored).man-up.sh,man-down.sh,run-bash.sh— container convenience helpers.requirements.txt— repo-local.venvwithpodman-compose(sosystemd --usercan find it). The venv is created bybin/setup.shviabin/venvfix.sh.sagequeue-progress.py— progress monitor: reads~/.config/sagequeue/sagequeue.envplus${HOME}/Jupyter/state_*files written byrank_boundary_sat_v18.sage(via--resume) and reports per-offset completion against the total (e.g.C(15,12)=455).
bin/:
bin/build-image.sh— builds the local image defined byContainerfile, seedspycryptosatinto the host DOT_SAGE directory${HOME}/.sagequeue-dot_sage(required because the DOT_SAGE bind mount masks image-built packages), and recreates thesagemathcontainer.bin/setup.sh— one-time bootstrap (safe to re-run): creates bind mounts, fixes permissions/ACLs, and ensures the repo-local.venvexists by runningbin/venvfix.sh(the only venv builder).bin/venvfix.sh— deterministic venv builder (fixedpython3, fixed./.venv, requiresrequirements.txt); invoked only bybin/setup.sh.bin/sagequeue-diag.sh— one-shot diagnostic snapshot (queue + systemd + container + solver procs).bin/sagequeue-ensure-container.sh— oneshot “ensure container is up” (used by systemd unitsagequeue-container.service).bin/sagequeue-worker.sh— worker loop (used bysagequeue@.service).bin/sagequeue-recover.sh— requeues orphanedrunning/jobs (used by systemd unitsagequeue-recover.service).bin/fix-bind-mounts.sh,bin/show-mapped-ids.sh— rootless bind-mount permission helpers.
Notebook location: rank_boundary_sat_v18.sage is expected to exist on the host at:
- Linux/WSL path:
${HOME}/Jupyter/rank_boundary_sat_v18.sage - Windows path (same directory):
~\Jupyter\rank_boundary_sat_v18.sage
That directory is bind-mounted as:
- container path:
/home/sage/notebooks/rank_boundary_sat_v18.sage
The experiment uses --resume, so state files (e.g. state_shrikhande_r3_stride8_off0.txt) live alongside the notebook in ${HOME}/Jupyter/ and survive reboots and container recreation.
The default configs use:
--solver sat --sat_backend cryptominisat
Sage’s cryptominisat backend requires pycryptosat inside Sage’s Python environment. Installing it manually in a running container works, but it is lost whenever the container is removed and recreated (e.g. podman-compose down).
This repository bakes pycryptosat into the image via Containerfile. The local image tag is:
localhost/sagequeue-sagemath:10.7-pycryptosat(forSAGE_TAG=10.7)
Important bind-mount constraint: podman-compose.yml bind-mounts ${HOME}/.sagequeue-dot_sage onto /home/sage/.sage. That bind mount masks any /home/sage/.sage content created in the image, so pycryptosat must be present in the host-mounted ${HOME}/.sagequeue-dot_sage. bin/build-image.sh automates this seeding and verifies import pycryptosat in the running container.
Edit /etc/wsl.conf (merge with existing sections; do not delete them):
[boot]
systemd=trueFrom Windows PowerShell:
wsl.exe --shutdownRe-open Ubuntu and confirm:
ps -p 1 -o comm=
podman ps
systemctl --user status >/dev/nullchmod +x bin/*.sh
bin/setup.shThis script creates the bind-mount directories (using .sagequeue-* names), fixes permissions/ACLs for rootless Podman bind mounts, and ensures podman-compose exists via the repo-local .venv. bin/venvfix.sh is invoked only from bin/setup.sh (do not run it directly).
From the repo root:
chmod +x bin/build-image.sh
bin/build-image.shWhat this does (by design):
- builds
localhost/sagequeue-sagemath:${SAGE_TAG}-pycryptosatfromContainerfile(defaultSAGE_TAG=10.7) - seeds
pycryptosatinto the host DOT_SAGE bind-mount directory${HOME}/.sagequeue-dot_sage(required because the bind mount masks image-built packages) - removes any existing
sagemathcontainer - runs
podman-compose up -d sagemathto start a fresh container from the new image
Important constraint: SAGE_TAG is the base Sage version (e.g. 10.7). Do not include -pycryptosat in SAGE_TAG.
Shrikhande rank-3 jobset:
chmod +x bin/*.sh
make CONFIG=config/shrikhande_r3.mk enableThis:
- writes
~/.config/sagequeue/sagequeue.envfor the selected jobset - installs user unit files into
~/.config/systemd/user/(not/etc/systemd/system) - enables and starts:
sagequeue-container.servicesagequeue-recover.timersagequeue@1.service…sagequeue@WORKERS.service
make CONFIG=config/shrikhande_r3.mk enqueue-stridemake CONFIG=config/shrikhande_r3.mk progress
make CONFIG=config/shrikhande_r3.mk logs
python3 sagequeue-progress.pymake progressreports queue directory counts.sagequeue-progress.pyreports case progress (e.g., 81/455) from the Sage state files.
A full snapshot:
make CONFIG=config/shrikhande_r3.mk diagLogs land in:
var/shri_r3/log/shri_r3_off0.log- …
var/shri_r3/log/shri_r3_off7.log
This repository includes a deterministic stride/offset smoke test workload script:
- Repo copy:
Jupyter/template.sage(i.e.,$PROJECT_ROOT/Jupyter/template.sage) - Runtime copy (host bind mount):
${HOME}/Jupyter/template.sage - Container path:
/home/sage/notebooks/template.sage
Because podman-compose.yml bind-mounts ${HOME}/Jupyter to /home/sage/notebooks, the container (and therefore the workers) can only run template.sage if it exists in ${HOME}/Jupyter.
From the repo root, copy the template into the notebook mount:
cp -f ./Jupyter/template.sage "$HOME/Jupyter/template.sage"Then run the template jobset:
make CONFIG=config/template.mk enable
make CONFIG=config/template.mk enqueue-strideMonitor with the standard commands:
make CONFIG=config/template.mk progress
make CONFIG=config/template.mk logs
make CONFIG=config/template.mk diagA jobset config sets (at minimum):
JOBSET— directory prefix undervar/STRIDE— number of offsetsWORKERS— number ofsystemdworker instancesSAGE_BASE_ARGS— experiment flags excluding--strideand--offset- stop file paths for host and container
The Makefile writes those into:
~/.config/sagequeue/sagequeue.env
Systemd units source that file at runtime.
A jobset is a named experiment run (graph + rank + solver configuration + stride/worker count) with its own isolated queue and logs.
In practice, JOBSET is the short identifier used to namespace runtime state under var/ so multiple experiments do not collide.
JOBSETis set by the selectedconfig/*.mkfile (e.g.JOBSET=shri_r3,JOBSET=rook_r3).- Switching jobsets changes the
var/<JOBSET>/...directories and the log prefix, but the worker logic is unchanged.
Example: with JOBSET=shri_r3, queue state lives under var/shri_r3/queue/... and logs under var/shri_r3/log/.
On-disk layout per jobset:
var/<JOBSET>/queue/pending/var/<JOBSET>/queue/running/var/<JOBSET>/queue/done/var/<JOBSET>/queue/failed/var/<JOBSET>/log/var/<JOBSET>/run/
Each job is a tiny env file (currently OFFSET=<k>). Workers:
- claim a job by atomic move
pending → running - execute Sage inside the container with:
- configured
SAGE_BASE_ARGS - plus injected
--stride STRIDE --offset OFFSET
- configured
- move the job file to
done/orfailed/
The job’s state is defined by which directory contains that file:
pending/— eligible to be claimed by a workerrunning/— claimed by a worker (ownership metadata stored in*.owner)done/— completed successfully (solver exit code 0)failed/— completed unsuccessfully (solver exit code nonzero) or malformed job file
State transitions are implemented as filesystem moves (mv) within the same jobset directory tree, so claiming work is an atomic pending → running move.
stateDiagram-v2
direction TB
state "pending/" as P
state "running/" as R
state "done/" as D
state "failed/" as F
note left of P: var/{JOBSET}/queue/pending
note left of R: var/{JOBSET}/queue/running
note right of D: var/{JOBSET}/queue/done
note right of F: var/{JOBSET}/queue/failed
[*] --> P: enqueue (create *.env)
P --> R: claim
R --> D: rc==0
R --> F: rc!=0
F --> P: retry (recover, bounded) / retry-failed
R --> P: recover (orphan) / pause (stop_file)
Transition meanings (on disk):
-
enqueue (create
*.env)make … enqueue-stridewrites one job file per offset intovar/{JOBSET}/queue/pending/, e.g.shri_r3_off7.envcontaining at leastOFFSET=7(andENQUEUED_AT=...). -
claim (
pending → running) A worker claims work by an atomic rename:mv var/{JOBSET}/queue/pending/<job>.env var/{JOBSET}/queue/running/<job>.envIt then writes an owner sidecar file:var/{JOBSET}/queue/running/<job>.env.ownercontaining:OWNER_PID=$$(host PID of the worker process)OWNER_WORKER_ID=<N>OWNER_TS=<timestamp>
-
rc==0 (
running → done) Afterpodman exec … ./sage …exits with status 0, the worker moves:mv …/running/<job>.env …/done/<job>.env -
rc!=0 (
running → failed) If Sage exits nonzero (including configuration errors detected by the worker after sourcing the job file), the worker moves:mv …/running/<job>.env …/failed/<job>.env -
retry-failed (
failed → pending)make … retry-failedmoves every*.envfile fromfailed/back topending/so workers will rerun them. -
retry (recover, capped) (
failed → pending)sagequeue-recover.timerrunsbin/sagequeue-recover.sh, which also scansfailed/*.envand retries them automatically by moving them back topending/. Each retry updates the job file in place by adding/updating:ATTEMPTS=<n>LAST_RETRY_TS=<timestamp>OnceATTEMPTSreaches the script’sMAX_FAILED_RETRIES, recovery leaves the job infailed/and logsaction=hold_failed.
-
pause (stop_file) (
running → pending) If the host stop file exists and Sage exits cleanly due to--stop_file, the worker requeues the job topending/(it will resume afterclear-stop). Workers also avoid starting a newly claimed job if the stop file appears between claim andpodman exec. -
recover (orphan) (
running → pending) Recovery scansrunning/*.envand treats a job as orphaned if its*.ownerfile is missing, cannot be sourced, the recordedOWNER_PIDis not alive (kill -0fails), or that PID is alive but is no longer asagequeue-worker.shprocess. For each orphaned job, recovery removes any stale*.ownerand moves:mv …/running/<job>.env …/pending/<job>.envRecovery is guarded by a globalflockonvar/{JOBSET}/run/recover.lockso it is safe to run from multiple workers and from the systemd timer.
A job in running/ is considered owned when it has a sibling owner file:
running/<job>.envrunning/<job>.env.owner
When a worker claims a job (pending → running), it writes <job>.env.owner containing:
OWNER_PID=$$(the worker’s PID on the host)OWNER_WORKER_ID=<N>OWNER_TS=<timestamp>
The recovery script (bin/sagequeue-recover.sh) scans running/*.env and treats a job as orphaned if any of the following is true:
- the
.ownerfile is missing, or - the
.ownerfile does not contain a validOWNER_PID=<integer>line (recovery does notsourcethe owner file), or kill -0 $OWNER_PIDfails (worker PID no longer exists), or- the PID exists but
ps -p $OWNER_PID -o args=does not containsagequeue-worker.sh(PID has been reused for some other process)
For each orphaned job, recovery performs the state transition:
running/<job>.env → pending/<job>.env
and removes the stale owner file.
Concurrency control: bin/sagequeue-recover.sh takes a single global lock (var/<JOBSET>/run/recover.lock) via flock, so multiple workers and the systemd timer can all invoke recovery safely.
Where recovery runs:
- once at the start of every worker process (worker does a one-shot recovery before entering the main claim loop)
- periodically via
sagequeue-recover.timer→sagequeue-recover.service
What recovery does not do: it does not inspect whether a container-side solver process is still running. Ownership is defined strictly in terms of the host worker PID recorded in the .owner file.
Recovery logging is grep/awk-friendly by design. Example audit commands:
journalctl --user -u sagequeue-recover.service -n 200 -o cat | grep '^\[recover\]'
journalctl --user -u sagequeue-recover.service -o cat | grep 'action=retry_failed'
journalctl --user -u sagequeue-recover.service -o cat | grep 'action=hold_failed'SAGE_BASE_ARGS must not include --stride or --offset.
STRIDEcomes fromconfig/*.mkOFFSETcomes from the queued job file- the worker injects both
Workers exit with a configuration error if SAGE_BASE_ARGS contains either flag.
Units shipped in systemd/:
-
sagequeue-container.serviceOneshot service. Ensures thesagemathcontainer exists and is running. This is a dependency of the workers. -
sagequeue@.serviceTemplate unit. Instancesagequeue@N.serviceruns one worker loop (see “Queue model”). -
sagequeue-recover.serviceScansrunning/for jobs left behind by crashes/reboots and requeues them topending/. -
sagequeue-recover.timerTriggerssagequeue-recover.serviceperiodically.
Inspect the live installed units:
systemctl --user cat sagequeue-container.service
systemctl --user cat sagequeue@.service
systemctl --user cat sagequeue-recover.timermake CONFIG=config/shrikhande_r3.mk restartmake CONFIG=config/shrikhande_r3.mk stopCreates the host stop file that is bind-mounted into the container:
make CONFIG=config/shrikhande_r3.mk request-stopTo resume:
make CONFIG=config/shrikhande_r3.mk clear-stopmake CONFIG=config/shrikhande_r3.mk requeue-runningmake CONFIG=config/shrikhande_r3.mk retry-failedFailed jobs are also retried automatically by sagequeue-recover.timer up to MAX_FAILED_RETRIES
(tracked in each job file as ATTEMPTS= / LAST_RETRY_TS=). Once the maximum is reached, recovery
leaves the job in failed/ and logs action=hold_failed.
make CONFIG=config/shrikhande_r3.mk purge-queue
make CONFIG=config/shrikhande_r3.mk enqueue-stridemake CONFIG=config/rook_r3.mk env restart
make CONFIG=config/rook_r3.mk enqueue-strideQueue state remains separated:
var/shri_r3/...var/rook_r3/...
bin/build-image.sh removes and recreates the sagemath container.
If workers are running during container removal, their in-flight podman exec sagemath ...
calls fail and the corresponding job files move to var/<JOBSET>/queue/failed/
(typical exit codes: rc=137 / rc=255, including “container has already been removed”).
Run this sequence (example jobset: Shrikhande r=3):
# Stop new job claims, then stop all jobset services
make CONFIG=config/shrikhande_r3.mk request-stop
make CONFIG=config/shrikhande_r3.mk stop
# Rebuild image and recreate container
bin/build-image.sh
# Start services again
make CONFIG=config/shrikhande_r3.mk restart
# Requeue jobs that failed due to the rebuild
make CONFIG=config/shrikhande_r3.mk retry-failed
# Allow new job claims again
make CONFIG=config/shrikhande_r3.mk clear-stop
# Verification: confirm the running container is using the rebuilt image
podman inspect sagemath --format 'ImageName={{.ImageName}} ContainerImageID={{.Image}}'
podman image inspect localhost/sagequeue-sagemath:${SAGE_TAG:-10.7}-pycryptosat --format 'BuiltImageID={{.Id}} Tags={{.RepoTags}}'
# Verification: confirm pycryptosat is importable in the *running* container
podman exec -it sagemath bash -c 'cd /sage && ./sage -python -c "import pycryptosat; print(pycryptosat.__version__)"'
# Verification: confirm workers are active and jobs are being claimed
systemctl --user --no-pager -l status "sagequeue@1.service"
make CONFIG=config/shrikhande_r3.mk progress
podman exec sagemath bash -c "pgrep -af '^python3 .*rank_boundary_sat_v18\.sage\.py' | wc -l"Jupyter URL:
http://localhost:8888
Token extraction (note 2>&1, because Jupyter token lines may appear on stderr in podman logs output):
podman logs --tail 2000 sagemath 2>&1 | grep -Eo 'token=[0-9a-f]+' | tail -n 1URL with token:
TOKEN="$(podman logs --tail 2000 sagemath 2>&1 | grep -Eo 'token=[0-9a-f]+' | tail -n 1)"
echo "http://localhost:8888/tree?${TOKEN}"MIT. See LICENSE.