HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest by FluffyAIcode · Pull Request #53 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-25T14:07:00Z

Summary

Finishes the HF Space deployment package for demos/hf_llama_kakeyalattice/:

Fixes two user-reported bugs in app.py that were degrading the demo experience on the live Space (FluffyAIcode/LLM-KA-Cache-Compress).
Adds the four files needed for a Docker-SDK HF Space.
Drops the demo suffix from the page title.
Switches the default model from Qwen2-0.5B to Qwen3-0.6B — head_dim=128 + GQA 16/8, matching production LLM shape.
Rewrites the Space subtitle as a product pitch (non-Gaussian / heavy-tail framing).
Adds a "When to pick KakeyaLattice over HQQ / Quanto / KIVI" comparison block to surface adjacent-method positioning on the Space's landing page.

Commits

1. `fix(hf_demo): correct compression display + switch default prompt to avoid greedy-loop`

Bug #1 — display Compression: Nx (-XX% bits vs bf16) instead of raw bits. Bug #2 — default prompt → "List five countries in Africa:", add gr.Examples + explanatory block. Also binds demo.launch(server_name="0.0.0.0", server_port=7860) for HF Space Docker.

2. `feat(hf_demo): Docker-SDK Space manifest`

Dockerfile, requirements.txt, SPACE_README.md, HF_SPACE_DEPLOY.md.

3. `chore(hf_demo): drop 'demo' suffix from page title`

Across app.py, README.md, SPACE_README.md.

4. `feat(hf_demo): switch default model Qwen2-0.5B → Qwen3-0.6B`

head_dim 128 + GQA 16/8 matches production LLMs. Bump transformers >= 4.51. GPU upgrade path updated to Qwen3-1.7B (T4-small) / Qwen3-4B (A10G-small).

5. `chore(hf_demo): rewrite Space subtitle as product pitch`

"By dynamically adapting to the empirical non-Gaussian patterns and heavy-tail characteristics of real LLM KV activations, our solution achieves near-lossless compression and performance gains on models like Qwen3."

6. `docs(hf_demo): add 'When to pick KakeyaLattice over HQQ / Quanto / KIVI' comparison block`

New section in SPACE_README.md positioning KakeyaLattice against adjacent methods:

HQQ / AWQ / GPTQ — flagged as weight quantisers (orthogonal).
QuantoQuantizedCache / HQQQuantizedCache — scalar, KakeyaLattice 9–38 % harder at ≤1 % |Δppl| across the four benchmark models.
KIVI (2-bit KV) — hits similar bit budgets but cannot gaussianise heavy tails.
SnapKV / H2O / Scissorhands — eviction, orthogonal.

The 9–38 % range is taken verbatim from the iso-PPL table in PR #54 (new GitHub README), which reproduces the published n=8 iso-PPL JSON 1:1.

Deployment status

Commit	Space commit	Note
1+2	292780d	initial Docker SDK deploy
3	2b572cf	drop "demo" suffix
4	23d974d	Qwen3-0.6B default
5	5c37c08	product-pitch subtitle
6	not yet deployed to Space (old HF_TOKEN rotated — see PR #54 author note)	will sync after merge

Space is live at https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress.

Test plan

python -c "import ast; ast.parse(open('demos/hf_llama_kakeyalattice/app.py').read())" — syntax OK.
All markdown/config files are static.
Qwen3-0.6B config sanity-checked via HF config.json.
Live smoke: the Space has been building and serving successfully on each of the first 5 commits above.

…avoid greedy-loop Two bugs reported in the HF Space demo (FluffyAIcode/LLM-KA-Cache-Compress): Bug #1 — Compression label was misleading Old output showed 'Compression: 32x' for E8 Q=10 because the bare bits-per-token-per-head number (~320) was truncated/misread as a scalar ratio. Users mistook it for byte-level KV savings. Fix: report both the ratio relative to the bf16 baseline (bf16_bits = head_dim * 16) and the percentage bit saving, e.g. '3.20x (-69% bits vs bf16)'. Header now shows 'bf16 reference bits/vec' so the denominator is explicit. Bug #2 — Default prompt triggered greedy-decode repetition Qwen2-0.5B under greedy decoding on the open-ended prompt 'Explain in one paragraph why lattice quantisation can beat scalar quantisation:' falls into a same-sentence loop. Users then see all four configs produce near-identical looping output and conclude the codec is broken. Fix: switch default prompt to 'List five countries in Africa:' (short, fact-shaped), add gr.Examples with four safe prompts, and surface an 'About the default model' explanatory block clarifying that repetition loops on open-ended prompts are a small-model property, not a codec issue. Also: - Bind demo.launch() to 0.0.0.0:7860 so the app is reachable inside a HF Space Docker container (default 127.0.0.1 would 404). - Minor: factor bf16_bits out of the row loop. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… SPACE_README + deploy guide) The HF Space FluffyAIcode/LLM-KA-Cache-Compress was created with the Docker SDK (not the Gradio SDK). Docker gives full control over the Python + system-package environment but requires Dockerfile + requirements.txt + a Docker-flavoured README.md at the Space repo root. Adds: - Dockerfile: python:3.11-slim base, unprivileged UID 1000 (HF convention), CPU-only torch via https://download.pytorch.org/whl/cpu for small images and fast cold start on the free CPU tier. EXPOSE 7860 + CMD python app.py. - requirements.txt: kakeyalattice[hf]>=1.5.0, gradio>=4.44, transformers>=4.45, torch>=2.1, plus sentencepiece/tiktoken for Qwen2 / LLaMA tokenisers. - SPACE_README.md: HF-rendered YAML front-matter with sdk: docker, app_port: 7860. Describes the demo + caveats (KV roundtrip vs real HBM savings, decoder-latency overhead). This file is renamed to README.md at the Space repo root on deploy. - HF_SPACE_DEPLOY.md: step-by-step deploy procedure (clone empty Space repo, copy four files, push), expected build behaviour, free tier specs, GPU upgrade path, troubleshooting. The existing demos/hf_llama_kakeyalattice/README.md (sdk: gradio) is retained for GitHub readers; the two serve different audiences and must not be merged. Deploy guide documents this explicitly. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…cache compression) Renames the Gradio page title and H1 from 'KakeyaLattice KV-cache compression demo' to 'KakeyaLattice KV-cache compression' across the three user-visible surfaces: - demos/hf_llama_kakeyalattice/app.py (gr.Blocks title + gr.Markdown H1) - demos/hf_llama_kakeyalattice/README.md (GitHub-readers README) - demos/hf_llama_kakeyalattice/SPACE_README.md (HF Space rendered README) The HF_SPACE_DEPLOY.md sample 'git commit -m' line is left unchanged since it is an illustrative shell snippet, not a title. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Qwen3-0.6B is a better fit for what this demo is showing than the old Qwen2-0.5B default: - head_dim = 128 (vs 64), divisible by 8 -> E8 codec runs natively at the same shape used by every production-scale LLM (Llama-3 / Qwen3 /DeepSeek-V3 family all use head_dim=128). Compression numbers the user reads off the demo now translate directly to production. - GQA 16 query / 8 KV heads (vs Qwen2-0.5B's pure MHA). Again, this matches production KV layouts, so the 'bits/vec' savings reported are representative of real KV memory savings. - Still fits on the free HF CPU tier (~2.4 GB fp32 vs 16 GB RAM). Trade-off: a full 'Run comparison' click on free CPU takes ~4-8 minutes (4 generations, 128 tokens each, 2 vCPU). This is documented in the About-the-default-model block, SPACE_README, and HF_SPACE_DEPLOY so users aren't surprised. Changes: - app.py: DEFAULT_MODEL -> Qwen/Qwen3-0.6B; About-the-default-model block updated with timing caveat + new GPU upgrade path (Qwen3-1.7B / Qwen3-4B on T4-small / A10G-small). - requirements.txt: transformers >= 4.51 (Qwen3ForCausalLM was added in 4.51; 4.45 would crash on from_pretrained). - SPACE_README.md: default-model paragraph rewritten; code sample switched to Qwen3-0.6B; bits/vec reference table updated for head_dim=128 (baseline 2048 bits/vec) with note that percentage savings are head_dim-invariant. - README.md: default-model paragraph + KAKEYA_DEMO_MODEL examples switched to Qwen3 family. - HF_SPACE_DEPLOY.md: free-tier spec section, GPU upgrade section, OOM troubleshooting entry all updated for Qwen3 defaults + timings. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… / heavy-tail framing) Replaces the technical-first subtitle on the Space landing page ('Compare generation output + latency ... The E8 variant uses 8-D nested-lattice closest-point quantisation ...') with a product-pitch framing that foregrounds the value proposition: By dynamically adapting to the empirical non-Gaussian patterns and heavy-tail characteristics of real LLM KV activations, our solution achieves near-lossless compression and performance gains on models like Qwen3. Rationale: - New framing matches the narrative in reports/paper/ and the DeepSeek-V4-Flash Stage 0.75 findings (the codec's edge comes from matching empirical KV distributions, not from generic lattice geometry). - Mentions Qwen3 explicitly, matching the demo's new default model. - Uses 'near-lossless' (standard term for <1% ppl loss) instead of the oxymoron 'highly lossless'; matches terminology already in SPACE_README.md and the per-row output labels. - Full technical description of E8 (Sylvester-Hadamard rotation, per-vector adaptive scaling) is retained in SPACE_README.md's 'How it works' section — moved out of the landing paragraph, not deleted. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…VI' comparison block Adds a four-bullet comparison section that explicitly positions KakeyaLattice against the adjacent KV-cache-quant and eviction methods Space visitors are most likely to already know: - HQQ / AWQ / GPTQ: flagged as weight quantisers (orthogonal). - QuantoQuantizedCache / HQQQuantizedCache (per-channel scalar in transformers): cites the 9 %-38 % CR advantage at <=1 % |Δppl| across four models, with link back to the GitHub README table. - KIVI (2-bit KV): explains why the Hadamard rotation matters. - SnapKV / H2O / Scissorhands: flagged as eviction (orthogonal). The 9-38 % range is taken from the iso-PPL table in the new GitHub README (PR #54), which is in turn reproduced 1:1 from the n=8 iso-PPL JSON under reports/v1_4_release/kv_128k_isoppl_n8/. No new claims — the Space page now surfaces the same numbers as the GitHub README. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 5 commits April 25, 2026 14:05

FluffyAIcode mentioned this pull request Apr 26, 2026

GEO + Credit: README hero + FAQ + landscape-survey blog + CITATION.cff + ACKNOWLEDGMENTS.md + DEPLOYMENTS.md + launch kit #54

Draft

cursor Bot mentioned this pull request Apr 26, 2026

bench(dsv4_stage075): n=8 + Q sweep on H200 — max usable CR = 1.27× vs FP8 on V4-Flash #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest#53

HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest#53
FluffyAIcode wants to merge 6 commits intomainfrom
AgentMemory/hf-space-docker-bugfixes-c478

FluffyAIcode commented Apr 25, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 25, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

1. fix(hf_demo): correct compression display + switch default prompt to avoid greedy-loop

2. feat(hf_demo): Docker-SDK Space manifest

3. chore(hf_demo): drop 'demo' suffix from page title

4. feat(hf_demo): switch default model Qwen2-0.5B → Qwen3-0.6B

5. chore(hf_demo): rewrite Space subtitle as product pitch

6. docs(hf_demo): add 'When to pick KakeyaLattice over HQQ / Quanto / KIVI' comparison block

Deployment status

Related

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Apr 25, 2026 •

edited by cursor Bot

Loading

1. `fix(hf_demo): correct compression display + switch default prompt to avoid greedy-loop`

2. `feat(hf_demo): Docker-SDK Space manifest`

3. `chore(hf_demo): drop 'demo' suffix from page title`

4. `feat(hf_demo): switch default model Qwen2-0.5B → Qwen3-0.6B`

5. `chore(hf_demo): rewrite Space subtitle as product pitch`

6. `docs(hf_demo): add 'When to pick KakeyaLattice over HQQ / Quanto / KIVI' comparison block`