Skip to content

Releases: am423/dflash-robot

dflash-robot v0.1.0

05 May 02:51

Choose a tag to compare

dflash-robot v0.1.0

Initial release of dflash-robot: GGUF-native DFlash speculative decoding runtime.

Highlights

  • 2.41x mean speedup on Qwen3.6-27B vs autoregressive (90.90 vs 37.73 tok/s)
  • Adapter-based architecture for expanding to any GGUF model with a compatible DFlash draft
  • dflash_inspect CLI to check any GGUF model's DFlash compatibility
  • 70 tests passing (58 C++ adapter/registry + 12 Python benchmark/registry)

Benchmark Results (RTX 3090)

Task AR tok/s DFlash tok/s Speedup
HumanEval 37.54 108.42 2.89x
GSM8K 37.79 72.61 1.92x
Math500 37.86 91.67 2.42x
Mean 37.73 90.90 2.41x

Draft: z-lab/Qwen3.5-27B-DFlash (cross-generation mismatch with Qwen3.6-27B target).

What's Included

  • Luce DFlash import with upstream attribution
  • ModelAdapter/DraftAdapter/RuntimeOrchestrator/CompatibilityRegistry interfaces
  • qwen35 adapter and qwen3_dflash draft adapter
  • dflash_inspect compatibility CLI
  • DFlash paper-to-code mapping and architecture docs
  • RTX 3090 baseline data and known DFlash model registry
  • Benchmark comparison reports
  • GitHub Actions CI

Known Limitations

  • Qwen3.6-27B only (qwen35 adapter); more adapters planned for v0.2
  • Batch size 1, greedy decoding only
  • No draft training pipeline
  • CUDA/NVIDIA only

What's Next (v0.2)

  • Qwen3.6-35B-A3B support (z-lab draft exists)
  • Dense transformer adapter (Llama-3.1-8B-Instruct)
  • Sampling/rejection beyond greedy