05 May 02:51

am423

dflash-robot v0.1.0 Latest

Latest

dflash-robot v0.1.0

Initial release of dflash-robot: GGUF-native DFlash speculative decoding runtime.

Highlights

2.41x mean speedup on Qwen3.6-27B vs autoregressive (90.90 vs 37.73 tok/s)
Adapter-based architecture for expanding to any GGUF model with a compatible DFlash draft
dflash_inspect CLI to check any GGUF model's DFlash compatibility
70 tests passing (58 C++ adapter/registry + 12 Python benchmark/registry)

Benchmark Results (RTX 3090)

Task	AR tok/s	DFlash tok/s	Speedup
HumanEval	37.54	108.42	2.89x
GSM8K	37.79	72.61	1.92x
Math500	37.86	91.67	2.42x
Mean	37.73	90.90	2.41x

Draft: z-lab/Qwen3.5-27B-DFlash (cross-generation mismatch with Qwen3.6-27B target).

What's Included

Luce DFlash import with upstream attribution
ModelAdapter/DraftAdapter/RuntimeOrchestrator/CompatibilityRegistry interfaces
qwen35 adapter and qwen3_dflash draft adapter
dflash_inspect compatibility CLI
DFlash paper-to-code mapping and architecture docs
RTX 3090 baseline data and known DFlash model registry
Benchmark comparison reports
GitHub Actions CI

Known Limitations

Qwen3.6-27B only (qwen35 adapter); more adapters planned for v0.2
Batch size 1, greedy decoding only
No draft training pipeline
CUDA/NVIDIA only

What's Next (v0.2)

Qwen3.6-35B-A3B support (z-lab draft exists)
Dense transformer adapter (Llama-3.1-8B-Instruct)
Sampling/rejection beyond greedy

Assets 2