Skip to content

Cosmosage Unshelver: GPU capacity shortage on Jetstream2, UI stuck at 0% #4

@zonca

Description

@zonca

Problem

The Cosmosage chat interface at https://cosmosage.phy240259.projects.jetstream-cloud.org/ is currently non-functional — the llama.cpp server web UI loads but generation stays stuck at 0% progress and never produces output.

Root Cause: GPU capacity shortage on Jetstream2

The original cosmosage_70b_zonca instance (flavor g3.xl = 1× full A100 40GB, 32 vCPU, 120GB RAM) could not be unshelved because Jetstream2 has no available g3.xl GPU hosts in the IU region. Every unshelve attempt since May 18, 2026 failed at the OpenStack scheduler with:

No valid host was found. There are not enough hosts available.

The scheduler error occurs at schedule_instances — this is not a transient failure. The 90 A100 GPU nodes (360 A100 SXM4 40GB GPUs) are fully occupied.

What was done

  1. Old instance deleted — The stuck g3.xl instance could not be unshelved, resized, or migrated (all blocked by OpenStack policy for SHELVED_OFFLOADED instances). After attempting to recreate it on g3.xl (which also failed with the same scheduling error), the original boot volume was unfortunately lost (the volume attachment had delete_on_termination=True).

  2. New instance created on g3.medium — This smaller GPU flavor (3× A100X-10C = 30GB total VRAM, 8 vCPU, 30GB RAM) does have available capacity on Jetstream2. The instance is now ACTIVE with floating IP 149.165.155.205.

  3. Model weights recovered — The llmstorage volume (100GB) from the old llm_unshelve instance was detached and attached to the new instance. It contains Meta-Llama-3.1-70B-Instruct-Q3_K_L.gguf (35GB, Q3_K_L quantization, 70.5B parameters).

  4. llama.cpp server running — Built llama.cpp with CUDA support, running on port 8080. The /health endpoint returns {"status":"ok"} and the web chat UI loads at the root path.

Current issue: generation stuck at 0%

The llama-server is configured with -ngl 40 (40 of ~81 layers on GPU, rest on CPU) because the 35GB model cannot fully fit in 30GB VRAM. With only 8 vCPU and 30GB RAM on g3.medium (vs 32 vCPU / 120GB RAM on the old g3.xl), CPU-offloaded inference is extremely slow — the UI shows 0% and appears frozen because token generation is happening but at an impractically slow rate.

The Q3_K_L 70B model is too large for the g3.medium flavor to run effectively with partial GPU offloading.

Possible solutions

  1. Use a smaller quantization — A Q2_K or IQ2_XS quantization of Llama-3.1-70B would be ~20-25GB and could fit entirely in the 30GB VRAM, avoiding CPU offloading entirely. Quality would degrade but inference would be fast.

  2. Use a smaller modelcosmosage-v3.1 (based on a smaller base model) could run entirely in VRAM and would be faster.

  3. Wait for g3.xl capacity — When A100 hosts free up on Jetstream2, recreate the instance with the full A100 flavor. There is no ETA for this.

  4. Use g3.large (16 vCPU, 60GB RAM, A100X-20C = 2×10GB slices = 20GB VRAM) — More CPU/RAM for offloading but still insufficient VRAM for the full model.

  5. Obtain the original setup scripts — The boot volume with the full server configuration (inference server, systemd services, model download scripts) was lost. If the original setup is documented or scripted, we can rebuild properly with whichever flavor/model combination is chosen.

Infrastructure details

Old (lost) Current
Flavor g3.xl g3.medium
GPU 1× A100 40GB 3× A100X-10C (30GB total)
vCPU 32 8
RAM 120GB 30GB
Model Q3_K_L 70B (35GB) Same (recovered from llmstorage)
Inference Unknown (vLLM?) llama.cpp with partial GPU offload
Instance ID 814d3dd3-... (deleted) a9220691-3479-4358-a2c8-c7298b16ac30

Action items

  • Decide on model/quantization that fits g3.medium VRAM
  • Recreate the inference server setup (original config lost with boot volume)
  • Set up systemd service for llama-server so it survives reboots
  • Test shelve/unshelve cycle with g3.medium (should work since this flavor has capacity)
  • Update the unshelver controller config if needed

cc @tijmen

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions