Problem
The Cosmosage chat interface at https://cosmosage.phy240259.projects.jetstream-cloud.org/ is currently non-functional — the llama.cpp server web UI loads but generation stays stuck at 0% progress and never produces output.
Root Cause: GPU capacity shortage on Jetstream2
The original cosmosage_70b_zonca instance (flavor g3.xl = 1× full A100 40GB, 32 vCPU, 120GB RAM) could not be unshelved because Jetstream2 has no available g3.xl GPU hosts in the IU region. Every unshelve attempt since May 18, 2026 failed at the OpenStack scheduler with:
No valid host was found. There are not enough hosts available.
The scheduler error occurs at schedule_instances — this is not a transient failure. The 90 A100 GPU nodes (360 A100 SXM4 40GB GPUs) are fully occupied.
What was done
-
Old instance deleted — The stuck g3.xl instance could not be unshelved, resized, or migrated (all blocked by OpenStack policy for SHELVED_OFFLOADED instances). After attempting to recreate it on g3.xl (which also failed with the same scheduling error), the original boot volume was unfortunately lost (the volume attachment had delete_on_termination=True).
-
New instance created on g3.medium — This smaller GPU flavor (3× A100X-10C = 30GB total VRAM, 8 vCPU, 30GB RAM) does have available capacity on Jetstream2. The instance is now ACTIVE with floating IP 149.165.155.205.
-
Model weights recovered — The llmstorage volume (100GB) from the old llm_unshelve instance was detached and attached to the new instance. It contains Meta-Llama-3.1-70B-Instruct-Q3_K_L.gguf (35GB, Q3_K_L quantization, 70.5B parameters).
-
llama.cpp server running — Built llama.cpp with CUDA support, running on port 8080. The /health endpoint returns {"status":"ok"} and the web chat UI loads at the root path.
Current issue: generation stuck at 0%
The llama-server is configured with -ngl 40 (40 of ~81 layers on GPU, rest on CPU) because the 35GB model cannot fully fit in 30GB VRAM. With only 8 vCPU and 30GB RAM on g3.medium (vs 32 vCPU / 120GB RAM on the old g3.xl), CPU-offloaded inference is extremely slow — the UI shows 0% and appears frozen because token generation is happening but at an impractically slow rate.
The Q3_K_L 70B model is too large for the g3.medium flavor to run effectively with partial GPU offloading.
Possible solutions
-
Use a smaller quantization — A Q2_K or IQ2_XS quantization of Llama-3.1-70B would be ~20-25GB and could fit entirely in the 30GB VRAM, avoiding CPU offloading entirely. Quality would degrade but inference would be fast.
-
Use a smaller model — cosmosage-v3.1 (based on a smaller base model) could run entirely in VRAM and would be faster.
-
Wait for g3.xl capacity — When A100 hosts free up on Jetstream2, recreate the instance with the full A100 flavor. There is no ETA for this.
-
Use g3.large (16 vCPU, 60GB RAM, A100X-20C = 2×10GB slices = 20GB VRAM) — More CPU/RAM for offloading but still insufficient VRAM for the full model.
-
Obtain the original setup scripts — The boot volume with the full server configuration (inference server, systemd services, model download scripts) was lost. If the original setup is documented or scripted, we can rebuild properly with whichever flavor/model combination is chosen.
Infrastructure details
|
Old (lost) |
Current |
| Flavor |
g3.xl |
g3.medium |
| GPU |
1× A100 40GB |
3× A100X-10C (30GB total) |
| vCPU |
32 |
8 |
| RAM |
120GB |
30GB |
| Model |
Q3_K_L 70B (35GB) |
Same (recovered from llmstorage) |
| Inference |
Unknown (vLLM?) |
llama.cpp with partial GPU offload |
| Instance ID |
814d3dd3-... (deleted) |
a9220691-3479-4358-a2c8-c7298b16ac30 |
Action items
cc @tijmen
Problem
The Cosmosage chat interface at
https://cosmosage.phy240259.projects.jetstream-cloud.org/is currently non-functional — the llama.cpp server web UI loads but generation stays stuck at 0% progress and never produces output.Root Cause: GPU capacity shortage on Jetstream2
The original
cosmosage_70b_zoncainstance (flavorg3.xl= 1× full A100 40GB, 32 vCPU, 120GB RAM) could not be unshelved because Jetstream2 has no availableg3.xlGPU hosts in the IU region. Every unshelve attempt since May 18, 2026 failed at the OpenStack scheduler with:The scheduler error occurs at
schedule_instances— this is not a transient failure. The 90 A100 GPU nodes (360 A100 SXM4 40GB GPUs) are fully occupied.What was done
Old instance deleted — The stuck
g3.xlinstance could not be unshelved, resized, or migrated (all blocked by OpenStack policy forSHELVED_OFFLOADEDinstances). After attempting to recreate it ong3.xl(which also failed with the same scheduling error), the original boot volume was unfortunately lost (the volume attachment haddelete_on_termination=True).New instance created on
g3.medium— This smaller GPU flavor (3× A100X-10C = 30GB total VRAM, 8 vCPU, 30GB RAM) does have available capacity on Jetstream2. The instance is nowACTIVEwith floating IP149.165.155.205.Model weights recovered — The
llmstoragevolume (100GB) from the oldllm_unshelveinstance was detached and attached to the new instance. It containsMeta-Llama-3.1-70B-Instruct-Q3_K_L.gguf(35GB, Q3_K_L quantization, 70.5B parameters).llama.cpp server running — Built llama.cpp with CUDA support, running on port 8080. The
/healthendpoint returns{"status":"ok"}and the web chat UI loads at the root path.Current issue: generation stuck at 0%
The llama-server is configured with
-ngl 40(40 of ~81 layers on GPU, rest on CPU) because the 35GB model cannot fully fit in 30GB VRAM. With only 8 vCPU and 30GB RAM ong3.medium(vs 32 vCPU / 120GB RAM on the oldg3.xl), CPU-offloaded inference is extremely slow — the UI shows 0% and appears frozen because token generation is happening but at an impractically slow rate.The Q3_K_L 70B model is too large for the
g3.mediumflavor to run effectively with partial GPU offloading.Possible solutions
Use a smaller quantization — A Q2_K or IQ2_XS quantization of Llama-3.1-70B would be ~20-25GB and could fit entirely in the 30GB VRAM, avoiding CPU offloading entirely. Quality would degrade but inference would be fast.
Use a smaller model —
cosmosage-v3.1(based on a smaller base model) could run entirely in VRAM and would be faster.Wait for
g3.xlcapacity — When A100 hosts free up on Jetstream2, recreate the instance with the full A100 flavor. There is no ETA for this.Use
g3.large(16 vCPU, 60GB RAM, A100X-20C = 2×10GB slices = 20GB VRAM) — More CPU/RAM for offloading but still insufficient VRAM for the full model.Obtain the original setup scripts — The boot volume with the full server configuration (inference server, systemd services, model download scripts) was lost. If the original setup is documented or scripted, we can rebuild properly with whichever flavor/model combination is chosen.
Infrastructure details
g3.xlg3.mediumllmstorage)814d3dd3-...(deleted)a9220691-3479-4358-a2c8-c7298b16ac30Action items
g3.mediumVRAMg3.medium(should work since this flavor has capacity)cc @tijmen