Status (re-measured)
The original 3-5× negative-level decode gap is closed to 1.3-1.4×; the missing dedicated greedy strategy has landed. This issue now tracks the residual negative-level decode parity work only.
Current numbers (i9, decodecorpus-z000033, level_-1_fast):
| Stream source |
Rust decode |
C decode |
Gap (was) |
c_stream |
907 µs (1.05 GiB/s) |
647 µs (1.47 GiB/s) |
1.40× (was 3.69×) |
rust_stream |
1.18 ms (824 MiB/s) |
895 µs (1.06 GiB/s) |
1.32× (was 5.35×) |
Landed since the original report
- Greedy strategy: dedicated
StrategyTag::Greedy (L5) with lazy_depth = 0 on the Row finder — the reference's own greedy shape (its greedy/lazy share the row-search template with depth 0). The L4 dfast outlier was separately closed by the donor greedy double-fast port.
- XXH64 frame checksum: gated on the frame's checksum flag (−61% on flag-off frames) and hashed per-block while cache-hot on the direct path; no longer a post-decode cold walk.
- Per-kernel decoders: four ISA tiers (Scalar/BMI2/AVX2/VBMI2) with monolithic per-tier sequence loops, BMI2 bit-reader specialization, per-tier match-copy chains.
- SIMD wildcopy / overshooting copies: donor-shape overshoot-tolerant copies with bounded tails per kernel.
Residual (the 1.3-1.4×)
The remaining gap is the known sequence-decode body delta vs the reference's single bmi2.constprop monolith (HUF 4-stream burst on literal-heavy frames being the biggest single item on weak-compression fixtures). Tracked levers, in ROI order:
- HUF burst decompress port (one inlined monolith vs our 3-fn split) — order-of-magnitude self-time delta on literal-heavy frames.
- Sequence-loop body instruction diff vs the reference per-tier.
Kill-switch criteria
Stop pulling individual levers when both stream sources sit within ~1.1× of the reference on the negative-level corpus, or when a lever returns <2% twice in a row (record the negative result and move on).
Status (re-measured)
The original 3-5× negative-level decode gap is closed to 1.3-1.4×; the missing dedicated greedy strategy has landed. This issue now tracks the residual negative-level decode parity work only.
Current numbers (i9, decodecorpus-z000033, level_-1_fast):
c_streamrust_streamLanded since the original report
StrategyTag::Greedy(L5) withlazy_depth = 0on the Row finder — the reference's own greedy shape (its greedy/lazy share the row-search template with depth 0). The L4 dfast outlier was separately closed by the donor greedy double-fast port.Residual (the 1.3-1.4×)
The remaining gap is the known sequence-decode body delta vs the reference's single
bmi2.constpropmonolith (HUF 4-stream burst on literal-heavy frames being the biggest single item on weak-compression fixtures). Tracked levers, in ROI order:Kill-switch criteria
Stop pulling individual levers when both stream sources sit within ~1.1× of the reference on the negative-level corpus, or when a lever returns <2% twice in a row (record the negative result and move on).