HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest#53
Draft
FluffyAIcode wants to merge 6 commits intomainfrom
Draft
HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest#53FluffyAIcode wants to merge 6 commits intomainfrom
FluffyAIcode wants to merge 6 commits intomainfrom
Conversation
…avoid greedy-loop Two bugs reported in the HF Space demo (FluffyAIcode/LLM-KA-Cache-Compress): Bug #1 — Compression label was misleading Old output showed 'Compression: 32x' for E8 Q=10 because the bare bits-per-token-per-head number (~320) was truncated/misread as a scalar ratio. Users mistook it for byte-level KV savings. Fix: report both the ratio relative to the bf16 baseline (bf16_bits = head_dim * 16) and the percentage bit saving, e.g. '3.20x (-69% bits vs bf16)'. Header now shows 'bf16 reference bits/vec' so the denominator is explicit. Bug #2 — Default prompt triggered greedy-decode repetition Qwen2-0.5B under greedy decoding on the open-ended prompt 'Explain in one paragraph why lattice quantisation can beat scalar quantisation:' falls into a same-sentence loop. Users then see all four configs produce near-identical looping output and conclude the codec is broken. Fix: switch default prompt to 'List five countries in Africa:' (short, fact-shaped), add gr.Examples with four safe prompts, and surface an 'About the default model' explanatory block clarifying that repetition loops on open-ended prompts are a small-model property, not a codec issue. Also: - Bind demo.launch() to 0.0.0.0:7860 so the app is reachable inside a HF Space Docker container (default 127.0.0.1 would 404). - Minor: factor bf16_bits out of the row loop. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… SPACE_README + deploy guide) The HF Space FluffyAIcode/LLM-KA-Cache-Compress was created with the Docker SDK (not the Gradio SDK). Docker gives full control over the Python + system-package environment but requires Dockerfile + requirements.txt + a Docker-flavoured README.md at the Space repo root. Adds: - Dockerfile: python:3.11-slim base, unprivileged UID 1000 (HF convention), CPU-only torch via https://download.pytorch.org/whl/cpu for small images and fast cold start on the free CPU tier. EXPOSE 7860 + CMD python app.py. - requirements.txt: kakeyalattice[hf]>=1.5.0, gradio>=4.44, transformers>=4.45, torch>=2.1, plus sentencepiece/tiktoken for Qwen2 / LLaMA tokenisers. - SPACE_README.md: HF-rendered YAML front-matter with sdk: docker, app_port: 7860. Describes the demo + caveats (KV roundtrip vs real HBM savings, decoder-latency overhead). This file is renamed to README.md at the Space repo root on deploy. - HF_SPACE_DEPLOY.md: step-by-step deploy procedure (clone empty Space repo, copy four files, push), expected build behaviour, free tier specs, GPU upgrade path, troubleshooting. The existing demos/hf_llama_kakeyalattice/README.md (sdk: gradio) is retained for GitHub readers; the two serve different audiences and must not be merged. Deploy guide documents this explicitly. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…cache compression) Renames the Gradio page title and H1 from 'KakeyaLattice KV-cache compression demo' to 'KakeyaLattice KV-cache compression' across the three user-visible surfaces: - demos/hf_llama_kakeyalattice/app.py (gr.Blocks title + gr.Markdown H1) - demos/hf_llama_kakeyalattice/README.md (GitHub-readers README) - demos/hf_llama_kakeyalattice/SPACE_README.md (HF Space rendered README) The HF_SPACE_DEPLOY.md sample 'git commit -m' line is left unchanged since it is an illustrative shell snippet, not a title. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Qwen3-0.6B is a better fit for what this demo is showing than the old Qwen2-0.5B default: - head_dim = 128 (vs 64), divisible by 8 -> E8 codec runs natively at the same shape used by every production-scale LLM (Llama-3 / Qwen3 /DeepSeek-V3 family all use head_dim=128). Compression numbers the user reads off the demo now translate directly to production. - GQA 16 query / 8 KV heads (vs Qwen2-0.5B's pure MHA). Again, this matches production KV layouts, so the 'bits/vec' savings reported are representative of real KV memory savings. - Still fits on the free HF CPU tier (~2.4 GB fp32 vs 16 GB RAM). Trade-off: a full 'Run comparison' click on free CPU takes ~4-8 minutes (4 generations, 128 tokens each, 2 vCPU). This is documented in the About-the-default-model block, SPACE_README, and HF_SPACE_DEPLOY so users aren't surprised. Changes: - app.py: DEFAULT_MODEL -> Qwen/Qwen3-0.6B; About-the-default-model block updated with timing caveat + new GPU upgrade path (Qwen3-1.7B / Qwen3-4B on T4-small / A10G-small). - requirements.txt: transformers >= 4.51 (Qwen3ForCausalLM was added in 4.51; 4.45 would crash on from_pretrained). - SPACE_README.md: default-model paragraph rewritten; code sample switched to Qwen3-0.6B; bits/vec reference table updated for head_dim=128 (baseline 2048 bits/vec) with note that percentage savings are head_dim-invariant. - README.md: default-model paragraph + KAKEYA_DEMO_MODEL examples switched to Qwen3 family. - HF_SPACE_DEPLOY.md: free-tier spec section, GPU upgrade section, OOM troubleshooting entry all updated for Qwen3 defaults + timings. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… / heavy-tail framing)
Replaces the technical-first subtitle on the Space landing page
('Compare generation output + latency ... The E8 variant uses 8-D
nested-lattice closest-point quantisation ...') with a
product-pitch framing that foregrounds the value proposition:
By dynamically adapting to the empirical non-Gaussian patterns
and heavy-tail characteristics of real LLM KV activations, our
solution achieves near-lossless compression and performance
gains on models like Qwen3.
Rationale:
- New framing matches the narrative in reports/paper/ and the
DeepSeek-V4-Flash Stage 0.75 findings (the codec's edge comes
from matching empirical KV distributions, not from generic
lattice geometry).
- Mentions Qwen3 explicitly, matching the demo's new default model.
- Uses 'near-lossless' (standard term for <1% ppl loss) instead of
the oxymoron 'highly lossless'; matches terminology already in
SPACE_README.md and the per-row output labels.
- Full technical description of E8 (Sylvester-Hadamard rotation,
per-vector adaptive scaling) is retained in SPACE_README.md's
'How it works' section — moved out of the landing paragraph,
not deleted.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…VI' comparison block Adds a four-bullet comparison section that explicitly positions KakeyaLattice against the adjacent KV-cache-quant and eviction methods Space visitors are most likely to already know: - HQQ / AWQ / GPTQ: flagged as weight quantisers (orthogonal). - QuantoQuantizedCache / HQQQuantizedCache (per-channel scalar in transformers): cites the 9 %-38 % CR advantage at <=1 % |Δppl| across four models, with link back to the GitHub README table. - KIVI (2-bit KV): explains why the Hadamard rotation matters. - SnapKV / H2O / Scissorhands: flagged as eviction (orthogonal). The 9-38 % range is taken from the iso-PPL table in the new GitHub README (PR #54), which is in turn reproduced 1:1 from the n=8 iso-PPL JSON under reports/v1_4_release/kv_128k_isoppl_n8/. No new claims — the Space page now surfaces the same numbers as the GitHub README. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Finishes the HF Space deployment package for
demos/hf_llama_kakeyalattice/:app.pythat were degrading the demo experience on the live Space (FluffyAIcode/LLM-KA-Cache-Compress).demosuffix from the page title.Commits
1.
fix(hf_demo): correct compression display + switch default prompt to avoid greedy-loopBug #1 — display
Compression: Nx (-XX% bits vs bf16)instead of raw bits. Bug #2 — default prompt →"List five countries in Africa:", addgr.Examples+ explanatory block. Also bindsdemo.launch(server_name="0.0.0.0", server_port=7860)for HF Space Docker.2.
feat(hf_demo): Docker-SDK Space manifestDockerfile,requirements.txt,SPACE_README.md,HF_SPACE_DEPLOY.md.3.
chore(hf_demo): drop 'demo' suffix from page titleAcross
app.py,README.md,SPACE_README.md.4.
feat(hf_demo): switch default model Qwen2-0.5B → Qwen3-0.6Bhead_dim 128 + GQA 16/8 matches production LLMs. Bump
transformers >= 4.51. GPU upgrade path updated to Qwen3-1.7B (T4-small) / Qwen3-4B (A10G-small).5.
chore(hf_demo): rewrite Space subtitle as product pitch6.
docs(hf_demo): add 'When to pick KakeyaLattice over HQQ / Quanto / KIVI' comparison blockNew section in
SPACE_README.mdpositioning KakeyaLattice against adjacent methods:The 9–38 % range is taken verbatim from the iso-PPL table in PR #54 (new GitHub README), which reproduces the published n=8 iso-PPL JSON 1:1.
Deployment status
Space is live at https://huggingface.co/spaces/FluffyAIcode/LLM-KA-Cache-Compress.
Related
Test plan
python -c "import ast; ast.parse(open('demos/hf_llama_kakeyalattice/app.py').read())"— syntax OK.config.json.