Releases: am423/dflash-robot
Releases · am423/dflash-robot
dflash-robot v0.1.0
dflash-robot v0.1.0
Initial release of dflash-robot: GGUF-native DFlash speculative decoding runtime.
Highlights
- 2.41x mean speedup on Qwen3.6-27B vs autoregressive (90.90 vs 37.73 tok/s)
- Adapter-based architecture for expanding to any GGUF model with a compatible DFlash draft
- dflash_inspect CLI to check any GGUF model's DFlash compatibility
- 70 tests passing (58 C++ adapter/registry + 12 Python benchmark/registry)
Benchmark Results (RTX 3090)
| Task | AR tok/s | DFlash tok/s | Speedup |
|---|---|---|---|
| HumanEval | 37.54 | 108.42 | 2.89x |
| GSM8K | 37.79 | 72.61 | 1.92x |
| Math500 | 37.86 | 91.67 | 2.42x |
| Mean | 37.73 | 90.90 | 2.41x |
Draft: z-lab/Qwen3.5-27B-DFlash (cross-generation mismatch with Qwen3.6-27B target).
What's Included
- Luce DFlash import with upstream attribution
- ModelAdapter/DraftAdapter/RuntimeOrchestrator/CompatibilityRegistry interfaces
- qwen35 adapter and qwen3_dflash draft adapter
- dflash_inspect compatibility CLI
- DFlash paper-to-code mapping and architecture docs
- RTX 3090 baseline data and known DFlash model registry
- Benchmark comparison reports
- GitHub Actions CI
Known Limitations
- Qwen3.6-27B only (qwen35 adapter); more adapters planned for v0.2
- Batch size 1, greedy decoding only
- No draft training pipeline
- CUDA/NVIDIA only
What's Next (v0.2)
- Qwen3.6-35B-A3B support (z-lab draft exists)
- Dense transformer adapter (Llama-3.1-8B-Instruct)
- Sampling/rejection beyond greedy