12 Mar 14:47

d7393e8

Squish v9.0.0 – Cutting-Edge Attention Variants & Distributed Inference Latest

Latest

Squish v9.0.0 – Cutting-Edge Attention Variants & Distributed Inference

Release Summary

Squish v9.0.0 introduces 28 new modules across Wave 25 (Cutting-Edge Attention Variants & Compute Fusion) and Wave 26 (Distributed Inference & Production Reliability).

Total modules now: 222 | Total tests: 4,876 | Test coverage: 100%

Wave 25: Cutting-Edge Attention Variants & Compute Fusion (14 modules)

Production-ready attention patterns from DeepSeek-V2/V3, kernel fusions, and speculative decode enhancements:

FlashMLA – DeepSeek-V2 multi-head latent attention; 4× KV compression; 0.55 µs append, 38.65 µs attend
NativeSparseAttn – DeepSeek-V3 block-sparse + sliding-window; ~87% attention sparsity; 646.6 µs forward
FusedSampler – Fused temperature/top-k/top-p/min-p/rep-penalty single-pass sampling; 1767 µs vocab=32k
KVDefrag – Online KV cache defragmentation; eliminates fragmentation ratio; 349 µs defrag
DualChunkAttn – Intra+inter-chunk for 1M+ contexts; O(chunk²) not O(seq²); 93.3 µs forward
ActivationOffload – Layer activation offload to CPU; peak GPU memory ↓; 6.34 µs fetch
MorphAttn – Per-layer pattern selection (full/sparse/linear); ~40% FLOP reduction at seq=2048
HydraSpec – Multi-draft head speculation; n_heads tokens/step; 1229 µs verify
SeqCompact – In-place KV compaction after token pruning; zero-copy repack; 141 µs
LatencyPredictor – OLS latency forecasting for scheduling; 0.82 µs predict; sub-microsecond
ParallelSampler – Best-of-n sampling with diversity; quality improvement with n candidates
ContextSummarizer – Inference-time context compression; keep semantics, shed tokens; 62.5 µs
TokenWatermark – Kirchenbauer statistical watermarking; detectable attribution
SchemaGen – FSM-accelerated constrained JSON; zero invalid tokens; 5.38 µs constrain

Wave 26: Distributed Inference & Production Reliability (14 modules)

Tensor/sequence parallelism, request scheduling, safety, monitoring, and audit logging:

TensorParallel – Row/column tensor sharding + all-reduce; linear memory scaling
SequenceParallel – Ulysses-style sequence scatter/gather; attention FLOPs distributed
KVMigrate – Live KV migration + checksum; zero-recompute worker handoff
DisaggPrefill – Disaggregated prefill→decode; hardware specialisation
RequestPreempt – SRPT preemption scheduler; priority inversion elimination
InferGateway – Smart routing + health + load balancing; single ingress, N workers
ModelVersionSwap – Zero-downtime version swaps; canary → promote → rollback in-flight
ProductionProfiler – APM per-op tracking; p50/p99/p999 per operation; sub-200ns record
AdaptiveBatcher – Throughput/latency SLO-aware batching; 1.91 µs next_batch
SafetyLayer – Inline safety classification; zero extra forward pass
SemanticResponseCache – Embedding-similarity dedup; exact + fuzzy cache hits
RateLimiter – Token-bucket per-tenant limiting; 0.92 µs consume
SchemaValidator – JSON schema validation; 100% schema-compliant outputs
AuditLogger – SHA-256 chained audit log; tamper-evident request provenance; 1.92 µs log

Highlights

✅ 222 modules total across 26 waves (v1–v9)
✅ 4,876 unit + integration tests — 100% coverage
✅ Micro-benchmarks for all modules (Wave 25+26 in dev/benchmarks/bench_wave25_26.py)
✅ Demo GIF (dev/demos/squish-v9-demo.gif) — 1.95 MB, 10+ scenes from Wave 25+26
✅ arXiv paper draft (docs/paper.md) — abstract, background, architecture, benchmarks, ethics
✅ HuggingFace integration (dev/publish_hf.py) — ready to publish pre-squished weights
✅ Production hardening — fault tolerance, observability, schema validation, audit logging

Documentation

README.md – Quick start, CLI examples, feature matrix
MODULES.md – Wave-by-wave module tables
docs/paper.md – Formal paper, benchmarks, architecture
dev/benchmarks/bench_eoe.py – End-to-end hardware benchmark harness
docs/benchmark_wave25_26.md – v9 micro-benchmark results

What's Next?

Phase 3: Hardware Validation

Run end-to-end benchmarks on M-series hardware:

squish serve --model qwen2.5:1.5b --port 11435 &
python3 dev/benchmarks/bench_eoe.py --runs 5 --output results/eoe_2026_03_12.json
# Results → README  + paper Section 4.1 (TTFT/tok-s)

Phase 4: Community & Publication

MMLU evaluation: lm_eval --tasks mmlu --limit 14042 → docs/RESULTS.md + paper
HuggingFace weights: python3 dev/publish_hf.py --model-dir ~/.cache/squish/...
Community posts: Hacker News, r/LocalLLaMA, Twitter/X
arXiv submission: docs/paper.md → LaTeX, submit to arxiv.org

Installation

pip install squish

# Pull a model (auto-caches after first conversion)
squish pull qwen2.5:1.5b

# Run inference at sub-second load time
squish run qwen2.5:1.5b  "What is machine learning?"

# Drop-in OpenAI-compatible server
squish serve qwen2.5:1.5b --port 11435
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"squish","messages":[{"role":"user","content":"Hello!"}]}'

Acknowledgments

Squish builds on work from MLX, HuggingFace, Meta (Llama), OpenAI, AnthropicAI, Stanford (SWEET), Microsoft (AWQ), QuIP#, VPTQ and other research communities. See papers.md Section 2 for full citations.

Assets 2

09 Mar 16:18

wesleyscholl

v1.0.1

e359e4a

v1.0.1 - KV cache fix, embeddings fix, AWQ CLI and log-level

What's fixed in v1.0.1

KV Cache Quantisation now actually works

The --kv-cache-mode int8 and --kv-cache-mode snap flags were silently no-ops in v1.0.0. Two bugs caused this:

KVLayerCache was missing update_and_fetch() and .offset — the per-layer cache interface required by mlx_lm. The cache was created but never called into during generation.
QuantizedKVCache.__getitem__ returned a _LayerCacheView wrapper that lacked the protocol method instead of the KVLayerCache itself.

Both are fixed. --kv-cache-mode int8/snap now routes each decode step through model(x, cache=layer_caches) with a graceful fallback to mlx_lm.stream_generate on error.

Semantic embeddings (`/v1/embeddings`) now return correct vectors

The /v1/embeddings endpoint was returning input-token embeddings, which are not useful for semantic similarity. It now uses model.model(x) (last hidden state) as the primary path, falling back to embed_tokens then logits mean-pool.

Server sampling helper added

Added a module-level _sample_mx() temperature + nucleus-sampling helper used by the quantized KV cache generation path (it was referenced but not defined in v1.0.0).

`--log-level` flag added end-to-end

Server verbosity was previously hardcoded. You can now pass it from squish run / squish serve and it is forwarded to the uvicorn process:

squish serve qwen3:8b --log-level debug
squish run qwen3:8b --log-level warning   # default

Accepted values: critical / error / warning / info / debug / trace

AWQ calibration exposed on `squish compress`

The AWQ activation-calibration pass was implemented internally but unreachable from the CLI. Two new flags fix that:

squish compress my-model --awq
squish compress my-model --awq --awq-samples 64

The command loads the full model, collects per-layer activation scales, and passes --awq-scales to the conversion subprocess automatically.

INT4 help-text corrected

Disk savings corrected from ~50% to ~44%.
Added explicit warning: INT4 produces degenerate output on models smaller than 3B parameters. Use INT8 (--int8, the default) for 1.5B models.

Eval report corrected

eval_output/eval_report.md contained physically impossible benchmark numbers (+14.1% ARC, +15.2% HellaSwag after lossy compression). Replaced with validated results from a clean re-run and a clearly labelled validity-notice header.

Upgrade

brew upgrade wesleyscholl/squish/squish   # Homebrew
pip install --upgrade squish              # pip

Full changelog: https://github.com/wesleyscholl/squish/blob/main/CHANGELOG.md

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Squish v9.0.0 – Cutting-Edge Attention Variants & Distributed Inference

Release Summary

Wave 25: Cutting-Edge Attention Variants & Compute Fusion (14 modules)

Wave 26: Distributed Inference & Production Reliability (14 modules)

Highlights

Documentation

What's Next?

Phase 3: Hardware Validation

Phase 4: Community & Publication

Installation

Acknowledgments

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's fixed in v1.0.1

KV Cache Quantisation now actually works

Semantic embeddings (`/v1/embeddings`) now return correct vectors

Server sampling helper added

`--log-level` flag added end-to-end

AWQ calibration exposed on `squish compress`

INT4 help-text corrected

Eval report corrected

Upgrade

Uh oh!

Releases: squishai/squish