Skip to content

HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest#53

Draft
FluffyAIcode wants to merge 6 commits intomainfrom
AgentMemory/hf-space-docker-bugfixes-c478
Draft

HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest#53
FluffyAIcode wants to merge 6 commits intomainfrom
AgentMemory/hf-space-docker-bugfixes-c478

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

@FluffyAIcode FluffyAIcode commented Apr 25, 2026

Summary

Finishes the HF Space deployment package for demos/hf_llama_kakeyalattice/:

  1. Fixes two user-reported bugs in app.py that were degrading the demo experience on the live Space (FluffyAIcode/LLM-KA-Cache-Compress).
  2. Adds the four files needed for a Docker-SDK HF Space.
  3. Drops the demo suffix from the page title.
  4. Switches the default model from Qwen2-0.5B to Qwen3-0.6B — head_dim=128 + GQA 16/8, matching production LLM shape.
  5. Rewrites the Space subtitle as a product pitch (non-Gaussian / heavy-tail framing).
  6. Adds a "When to pick KakeyaLattice over HQQ / Quanto / KIVI" comparison block to surface adjacent-method positioning on the Space's landing page.

Commits

1. fix(hf_demo): correct compression display + switch default prompt to avoid greedy-loop

Bug #1 — display Compression: Nx (-XX% bits vs bf16) instead of raw bits. Bug #2 — default prompt → "List five countries in Africa:", add gr.Examples + explanatory block. Also binds demo.launch(server_name="0.0.0.0", server_port=7860) for HF Space Docker.

2. feat(hf_demo): Docker-SDK Space manifest

Dockerfile, requirements.txt, SPACE_README.md, HF_SPACE_DEPLOY.md.

3. chore(hf_demo): drop 'demo' suffix from page title

Across app.py, README.md, SPACE_README.md.

4. feat(hf_demo): switch default model Qwen2-0.5B → Qwen3-0.6B

head_dim 128 + GQA 16/8 matches production LLMs. Bump transformers >= 4.51. GPU upgrade path updated to Qwen3-1.7B (T4-small) / Qwen3-4B (A10G-small).

5. chore(hf_demo): rewrite Space subtitle as product pitch

"By dynamically adapting to the empirical non-Gaussian patterns and heavy-tail characteristics of real LLM KV activations, our solution achieves near-lossless compression and performance gains on models like Qwen3."

6. docs(hf_demo): add 'When to pick KakeyaLattice over HQQ / Quanto / KIVI' comparison block

New section in SPACE_README.md positioning KakeyaLattice against adjacent methods:

  • HQQ / AWQ / GPTQ — flagged as weight quantisers (orthogonal).
  • QuantoQuantizedCache / HQQQuantizedCache — scalar, KakeyaLattice 9–38 % harder at ≤1 % |Δppl| across the four benchmark models.
  • KIVI (2-bit KV) — hits similar bit budgets but cannot gaussianise heavy tails.
  • SnapKV / H2O / Scissorhands — eviction, orthogonal.

The 9–38 % range is taken verbatim from the iso-PPL table in PR #54 (new GitHub README), which reproduces the published n=8 iso-PPL JSON 1:1.

Deployment status

Commit Space commit Note
1+2 292780d initial Docker SDK deploy
3 2b572cf drop "demo" suffix
4 23d974d Qwen3-0.6B default
5 5c37c08 product-pitch subtitle
6 not yet deployed to Space (old HF_TOKEN rotated — see PR #54 author note) will sync after merge

Space is live at https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress.

Related

Test plan

  • python -c "import ast; ast.parse(open('demos/hf_llama_kakeyalattice/app.py').read())" — syntax OK.
  • All markdown/config files are static.
  • Qwen3-0.6B config sanity-checked via HF config.json.
  • Live smoke: the Space has been building and serving successfully on each of the first 5 commits above.
Open in Web Open in Cursor 

cursoragent and others added 5 commits April 25, 2026 14:05
…avoid greedy-loop

Two bugs reported in the HF Space demo (FluffyAIcode/LLM-KA-Cache-Compress):

Bug #1 — Compression label was misleading
  Old output showed 'Compression: 32x' for E8 Q=10 because the bare
  bits-per-token-per-head number (~320) was truncated/misread as a
  scalar ratio. Users mistook it for byte-level KV savings.

  Fix: report both the ratio relative to the bf16 baseline
  (bf16_bits = head_dim * 16) and the percentage bit saving, e.g.
  '3.20x (-69% bits vs bf16)'. Header now shows 'bf16 reference
  bits/vec' so the denominator is explicit.

Bug #2 — Default prompt triggered greedy-decode repetition
  Qwen2-0.5B under greedy decoding on the open-ended prompt
  'Explain in one paragraph why lattice quantisation can beat scalar
  quantisation:' falls into a same-sentence loop. Users then see all
  four configs produce near-identical looping output and conclude the
  codec is broken.

  Fix: switch default prompt to 'List five countries in Africa:'
  (short, fact-shaped), add gr.Examples with four safe prompts, and
  surface an 'About the default model' explanatory block clarifying
  that repetition loops on open-ended prompts are a small-model
  property, not a codec issue.

Also:
- Bind demo.launch() to 0.0.0.0:7860 so the app is reachable inside
  a HF Space Docker container (default 127.0.0.1 would 404).
- Minor: factor bf16_bits out of the row loop.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… SPACE_README + deploy guide)

The HF Space FluffyAIcode/LLM-KA-Cache-Compress was created with the
Docker SDK (not the Gradio SDK). Docker gives full control over the
Python + system-package environment but requires Dockerfile +
requirements.txt + a Docker-flavoured README.md at the Space repo root.

Adds:

- Dockerfile: python:3.11-slim base, unprivileged UID 1000 (HF
  convention), CPU-only torch via
  https://download.pytorch.org/whl/cpu for small images and fast cold
  start on the free CPU tier. EXPOSE 7860 + CMD python app.py.

- requirements.txt: kakeyalattice[hf]>=1.5.0, gradio>=4.44,
  transformers>=4.45, torch>=2.1, plus sentencepiece/tiktoken for
  Qwen2 / LLaMA tokenisers.

- SPACE_README.md: HF-rendered YAML front-matter with sdk: docker,
  app_port: 7860. Describes the demo + caveats (KV roundtrip vs real
  HBM savings, decoder-latency overhead). This file is renamed to
  README.md at the Space repo root on deploy.

- HF_SPACE_DEPLOY.md: step-by-step deploy procedure (clone empty
  Space repo, copy four files, push), expected build behaviour, free
  tier specs, GPU upgrade path, troubleshooting.

The existing demos/hf_llama_kakeyalattice/README.md (sdk: gradio) is
retained for GitHub readers; the two serve different audiences and
must not be merged. Deploy guide documents this explicitly.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…cache compression)

Renames the Gradio page title and H1 from 'KakeyaLattice KV-cache
compression demo' to 'KakeyaLattice KV-cache compression' across the
three user-visible surfaces:

  - demos/hf_llama_kakeyalattice/app.py (gr.Blocks title + gr.Markdown H1)
  - demos/hf_llama_kakeyalattice/README.md (GitHub-readers README)
  - demos/hf_llama_kakeyalattice/SPACE_README.md (HF Space rendered README)

The HF_SPACE_DEPLOY.md sample 'git commit -m' line is left unchanged
since it is an illustrative shell snippet, not a title.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Qwen3-0.6B is a better fit for what this demo is showing than the old
Qwen2-0.5B default:

- head_dim = 128 (vs 64), divisible by 8 -> E8 codec runs natively
  at the same shape used by every production-scale LLM (Llama-3 / Qwen3
  /DeepSeek-V3 family all use head_dim=128). Compression numbers the
  user reads off the demo now translate directly to production.
- GQA 16 query / 8 KV heads (vs Qwen2-0.5B's pure MHA). Again, this
  matches production KV layouts, so the 'bits/vec' savings reported
  are representative of real KV memory savings.
- Still fits on the free HF CPU tier (~2.4 GB fp32 vs 16 GB RAM).

Trade-off: a full 'Run comparison' click on free CPU takes ~4-8 minutes
(4 generations, 128 tokens each, 2 vCPU). This is documented in the
About-the-default-model block, SPACE_README, and HF_SPACE_DEPLOY so
users aren't surprised.

Changes:
- app.py: DEFAULT_MODEL -> Qwen/Qwen3-0.6B; About-the-default-model
  block updated with timing caveat + new GPU upgrade path
  (Qwen3-1.7B / Qwen3-4B on T4-small / A10G-small).
- requirements.txt: transformers >= 4.51 (Qwen3ForCausalLM was added
  in 4.51; 4.45 would crash on from_pretrained).
- SPACE_README.md: default-model paragraph rewritten; code sample
  switched to Qwen3-0.6B; bits/vec reference table updated for
  head_dim=128 (baseline 2048 bits/vec) with note that percentage
  savings are head_dim-invariant.
- README.md: default-model paragraph + KAKEYA_DEMO_MODEL examples
  switched to Qwen3 family.
- HF_SPACE_DEPLOY.md: free-tier spec section, GPU upgrade section,
  OOM troubleshooting entry all updated for Qwen3 defaults + timings.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… / heavy-tail framing)

Replaces the technical-first subtitle on the Space landing page
('Compare generation output + latency ... The E8 variant uses 8-D
nested-lattice closest-point quantisation ...') with a
product-pitch framing that foregrounds the value proposition:

  By dynamically adapting to the empirical non-Gaussian patterns
  and heavy-tail characteristics of real LLM KV activations, our
  solution achieves near-lossless compression and performance
  gains on models like Qwen3.

Rationale:

- New framing matches the narrative in reports/paper/ and the
  DeepSeek-V4-Flash Stage 0.75 findings (the codec's edge comes
  from matching empirical KV distributions, not from generic
  lattice geometry).
- Mentions Qwen3 explicitly, matching the demo's new default model.
- Uses 'near-lossless' (standard term for <1% ppl loss) instead of
  the oxymoron 'highly lossless'; matches terminology already in
  SPACE_README.md and the per-row output labels.
- Full technical description of E8 (Sylvester-Hadamard rotation,
  per-vector adaptive scaling) is retained in SPACE_README.md's
  'How it works' section — moved out of the landing paragraph,
  not deleted.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…VI' comparison block

Adds a four-bullet comparison section that explicitly positions
KakeyaLattice against the adjacent KV-cache-quant and eviction methods
Space visitors are most likely to already know:

- HQQ / AWQ / GPTQ: flagged as weight quantisers (orthogonal).
- QuantoQuantizedCache / HQQQuantizedCache (per-channel scalar
  in transformers): cites the 9 %-38 % CR advantage at <=1 % |Δppl|
  across four models, with link back to the GitHub README table.
- KIVI (2-bit KV): explains why the Hadamard rotation matters.
- SnapKV / H2O / Scissorhands: flagged as eviction (orthogonal).

The 9-38 % range is taken from the iso-PPL table in the new GitHub
README (PR #54), which is in turn reproduced 1:1 from the n=8 iso-PPL
JSON under reports/v1_4_release/kv_128k_isoppl_n8/. No new claims — the
Space page now surfaces the same numbers as the GitHub README.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants