Cosmosage Unshelver: GPU capacity shortage on Jetstream2, UI stuck at 0%

## Problem

The Cosmosage chat interface at `https://cosmosage.phy240259.projects.jetstream-cloud.org/` is currently non-functional — the llama.cpp server web UI loads but generation stays stuck at 0% progress and never produces output.

### Root Cause: GPU capacity shortage on Jetstream2

The original `cosmosage_70b_zonca` instance (flavor `g3.xl` = 1× full A100 40GB, 32 vCPU, 120GB RAM) could not be unshelved because **Jetstream2 has no available `g3.xl` GPU hosts in the IU region**. Every unshelve attempt since May 18, 2026 failed at the OpenStack scheduler with:

> `No valid host was found. There are not enough hosts available.`

The scheduler error occurs at `schedule_instances` — this is not a transient failure. The 90 A100 GPU nodes (360 A100 SXM4 40GB GPUs) are fully occupied.

### What was done

1. **Old instance deleted** — The stuck `g3.xl` instance could not be unshelved, resized, or migrated (all blocked by OpenStack policy for `SHELVED_OFFLOADED` instances). After attempting to recreate it on `g3.xl` (which also failed with the same scheduling error), the original boot volume was unfortunately lost (the volume attachment had `delete_on_termination=True`).

2. **New instance created on `g3.medium`** — This smaller GPU flavor (3× A100X-10C = 30GB total VRAM, 8 vCPU, 30GB RAM) **does have available capacity** on Jetstream2. The instance is now `ACTIVE` with floating IP `149.165.155.205`.

3. **Model weights recovered** — The `llmstorage` volume (100GB) from the old `llm_unshelve` instance was detached and attached to the new instance. It contains `Meta-Llama-3.1-70B-Instruct-Q3_K_L.gguf` (35GB, Q3_K_L quantization, 70.5B parameters).

4. **llama.cpp server running** — Built llama.cpp with CUDA support, running on port 8080. The `/health` endpoint returns `{"status":"ok"}` and the web chat UI loads at the root path.

### Current issue: generation stuck at 0%

The llama-server is configured with `-ngl 40` (40 of ~81 layers on GPU, rest on CPU) because the 35GB model cannot fully fit in 30GB VRAM. With only 8 vCPU and 30GB RAM on `g3.medium` (vs 32 vCPU / 120GB RAM on the old `g3.xl`), CPU-offloaded inference is extremely slow — the UI shows 0% and appears frozen because token generation is happening but at an impractically slow rate.

**The Q3_K_L 70B model is too large for the `g3.medium` flavor to run effectively with partial GPU offloading.**

### Possible solutions

1. **Use a smaller quantization** — A Q2_K or IQ2_XS quantization of Llama-3.1-70B would be ~20-25GB and could fit entirely in the 30GB VRAM, avoiding CPU offloading entirely. Quality would degrade but inference would be fast.

2. **Use a smaller model** — `cosmosage-v3.1` (based on a smaller base model) could run entirely in VRAM and would be faster.

3. **Wait for `g3.xl` capacity** — When A100 hosts free up on Jetstream2, recreate the instance with the full A100 flavor. There is no ETA for this.

4. **Use `g3.large`** (16 vCPU, 60GB RAM, A100X-20C = 2×10GB slices = 20GB VRAM) — More CPU/RAM for offloading but still insufficient VRAM for the full model.

5. **Obtain the original setup scripts** — The boot volume with the full server configuration (inference server, systemd services, model download scripts) was lost. If the original setup is documented or scripted, we can rebuild properly with whichever flavor/model combination is chosen.

### Infrastructure details

| | Old (lost) | Current |
|---|---|---|
| Flavor | `g3.xl` | `g3.medium` |
| GPU | 1× A100 40GB | 3× A100X-10C (30GB total) |
| vCPU | 32 | 8 |
| RAM | 120GB | 30GB |
| Model | Q3_K_L 70B (35GB) | Same (recovered from `llmstorage`) |
| Inference | Unknown (vLLM?) | llama.cpp with partial GPU offload |
| Instance ID | `814d3dd3-...` (deleted) | `a9220691-3479-4358-a2c8-c7298b16ac30` |

### Action items

- [ ] Decide on model/quantization that fits `g3.medium` VRAM
- [ ] Recreate the inference server setup (original config lost with boot volume)
- [ ] Set up systemd service for llama-server so it survives reboots
- [ ] Test shelve/unshelve cycle with `g3.medium` (should work since this flavor has capacity)
- [ ] Update the unshelver controller config if needed

cc @tijmen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cosmosage Unshelver: GPU capacity shortage on Jetstream2, UI stuck at 0% #4

Problem

Root Cause: GPU capacity shortage on Jetstream2

What was done

Current issue: generation stuck at 0%

Possible solutions

Infrastructure details

Action items

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	Old (lost)	Current
Flavor	`g3.xl`	`g3.medium`
GPU	1× A100 40GB	3× A100X-10C (30GB total)
vCPU	32	8
RAM	120GB	30GB
Model	Q3_K_L 70B (35GB)	Same (recovered from `llmstorage`)
Inference	Unknown (vLLM?)	llama.cpp with partial GPU offload
Instance ID	`814d3dd3-...` (deleted)	`a9220691-3479-4358-a2c8-c7298b16ac30`

Cosmosage Unshelver: GPU capacity shortage on Jetstream2, UI stuck at 0% #4

Description

Problem

Root Cause: GPU capacity shortage on Jetstream2

What was done

Current issue: generation stuck at 0%

Possible solutions

Infrastructure details

Action items

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions