Problem
The btopt / btultra / btultra2 optimal parser (levels 16-22) has two
distinct gaps vs the upstream reference, both rooted in the same per-position
DP / match-selection code:
- A. Speed — slower than upstream only on small compressible /
dictionary inputs; on large and on incompressible inputs we already match
or beat it.
- B. Ratio (L18-L22, synthetic fixtures) — on adversarial synthetic 1 MiB
fixtures our btultra / btultra2 output is ~1% larger than upstream. This
was previously believed absent ("ratio is fine everywhere"); it is real but
small and fixture-specific — synthetic patterns only, real corpus files stay
<= upstream.
A. Speed gap
Dashboard outliers (x86_64-gnu, L22 btultra2):
small-4k-log-lines compress (no dict): 0.7523 (-24.77%)
small-10k-random compress-dict: 0.3734 (-62.66%)
i9 bench host:
small-1k-random compress (no dict): 2.9x faster (we win)
high-entropy-1m compress (no dict): 11.5x faster (we win)
small-10k-random compress-dict: rust 1.465 ms vs ref 0.485 ms = 3.0x
slower — the optimal-parser per-position cost is the whole residual (the
CDict-equivalent + zero-alloc + dms-cache work already landed).
Profile (encode_loop_dict, non-dict L22 logs4096, 59k samples):
build_optimal_plan_impl ~37%, collect_optimal_candidates_initialized
~16%, bt_insert_and_collect_matches ~10%, hash3_candidate ~5.6%
=> optimal parser ~70% (inherent L22 work).
__memset_avx2 7% = per-frame matcher-table zeroing on reset.
incompressible::update_sample_metrics 3.6% = the raw-fast-path pre-parse
sample (do NOT remove: it is exactly what gives the incompressible-small win).
- entropy (FSE / Huffman build) ~5%.
B. Ratio gap (L18-L22, synthetic fixtures)
compare_ffi REPORT (rust_bytes vs ffi_bytes, lower = better; i9/M1):
| fixture |
level |
rust |
ffi |
verdict |
| decodecorpus-synthetic-1m |
L16 btopt |
233 |
234 |
<= ok |
| decodecorpus-synthetic-1m |
L18 btultra |
231 |
228 |
+3 |
| decodecorpus-synthetic-1m |
L22 btultra2 |
231 |
228 |
+3 |
| decodecorpus-synthetic-1m |
L22 +ldm |
232 |
228 |
+4 |
| low-entropy-1m |
L16 btopt |
148 |
146 |
+2 |
| low-entropy-1m |
L18 btultra |
146 |
143 |
+3 |
| low-entropy-1m |
L22 btultra2 |
146 |
143 |
+3 |
| large-log-stream (100 MB) |
L18 btultra |
8944 |
8953 |
<= ok |
| large-log-stream (100 MB) |
L22 btultra2 |
8943 |
8940 |
+3 |
Characteristics:
- PRE-EXISTING — identical on the state before the block pre-splitter
sampling-tier fix (confirmed by an A/B at the commit just before it), so it
is NOT a pre-splitter regression; it is the optimal parser itself.
- Fixture-specific — only the SYNTHETIC adversarial fixtures
(decodecorpus-synthetic, low-entropy patterned). On the real decode-corpus
file z000033 L22 stays <= upstream
(cross_validation::level22_stays_within_ffi_level22_on_corpus_proxy passes).
- Small — ~1-2% on tiny (~230-byte) and 100 MB outputs alike.
- Same code as A — the btultra / btultra2 DP parser makes slightly worse
match choices than upstream on these patterns.
Levers
-
Optimal-parser per-position cost (~70%, the speed band-mover).
build_optimal_plan / collect_optimal_candidates / cost-model is already
structured like upstream btultra2 (const-folded optLevel,
generation-stamped price caches, seed pass, frontier DP). The residual is
instruction-level — asm diff of the per-position pipeline vs upstream's
bmi2-specialised inner loop. Brings the small-input SPEED metric back into
band.
-
L18-L22 ratio (B). Dump our btultra / btultra2 sequence stream vs
upstream's on decodecorpus-synthetic-1m + low-entropy-1m at L22, diff
the first divergence, attribute it to a price-model / sufficient_match_len
/ candidate-pruning difference vs upstream. Likely overlaps with lever 1
(same DP code). Gate: rust_bytes <= ffi_bytes on both synthetic fixtures
at L18-L22.
-
Per-frame table zeroing (7%, speed). MatchTable::reset fills
hash_table + chain_table + hash3_table every frame. Either right-size
the small-input tables to upstream's ZSTD_adjustCParams shape (blocked by
the deliberate 16 KiB hinted-window floor — ratio-sensitive, per-config
validation), or index-rebase instead of zeroing (match-table surgery,
roundtrip-gated).
-
Dictionary per-frame prime — DONE (prepared-dict snapshot/restore +
cached BT dms tree + zero steady-state allocations/frame).
Tooling
zstd/examples/encode_loop_dict.rs — clean per-frame profile (dict /
no-dict, reused context). logs<N> reproduces the *-log-lines fixtures
byte-for-byte.
- For the ratio gap:
STRUCTURED_ZSTD_EMIT_REPORT=1 on compare_ffi prints
rust_bytes vs ffi_bytes per fixture/level.
- Use these, not the
compare_ffi dict flamegraph (training / decode
pollution).
Out of scope
P0 "incompressible early-out": measured non-issue — we already beat upstream
on incompressible L22; the raw-fast-path gate is correct as-is, do not touch.
Estimate
- Lever 1 (speed instruction-parity): ~3-4d (deep asm-diff, multi-pass).
- Lever 2 (L18-L22 ratio): ~1d (sequence diff + price-model fix).
- Lever 3 (table zeroing): ~1d 4h.
Total ~5-6d.
Problem
The btopt / btultra / btultra2 optimal parser (levels 16-22) has two
distinct gaps vs the upstream reference, both rooted in the same per-position
DP / match-selection code:
dictionary inputs; on large and on incompressible inputs we already match
or beat it.
fixtures our btultra / btultra2 output is ~1% larger than upstream. This
was previously believed absent ("ratio is fine everywhere"); it is real but
small and fixture-specific — synthetic patterns only, real corpus files stay
<=upstream.A. Speed gap
Dashboard outliers (x86_64-gnu, L22
btultra2):small-4k-log-linescompress (no dict):0.7523(-24.77%)small-10k-randomcompress-dict:0.3734(-62.66%)i9 bench host:
small-1k-randomcompress (no dict): 2.9x faster (we win)high-entropy-1mcompress (no dict): 11.5x faster (we win)small-10k-randomcompress-dict: rust 1.465 ms vs ref 0.485 ms = 3.0xslower — the optimal-parser per-position cost is the whole residual (the
CDict-equivalent + zero-alloc + dms-cache work already landed).
Profile (
encode_loop_dict, non-dict L22logs4096, 59k samples):build_optimal_plan_impl~37%,collect_optimal_candidates_initialized~16%,
bt_insert_and_collect_matches~10%,hash3_candidate~5.6%=> optimal parser ~70% (inherent L22 work).
__memset_avx27% = per-frame matcher-table zeroing on reset.incompressible::update_sample_metrics3.6% = the raw-fast-path pre-parsesample (do NOT remove: it is exactly what gives the incompressible-small win).
B. Ratio gap (L18-L22, synthetic fixtures)
compare_ffiREPORT (rust_bytesvsffi_bytes, lower = better; i9/M1):<=ok<=okCharacteristics:
sampling-tier fix (confirmed by an A/B at the commit just before it), so it
is NOT a pre-splitter regression; it is the optimal parser itself.
(decodecorpus-synthetic, low-entropy patterned). On the real decode-corpus
file
z000033L22 stays<=upstream(
cross_validation::level22_stays_within_ffi_level22_on_corpus_proxypasses).match choices than upstream on these patterns.
Levers
Optimal-parser per-position cost (~70%, the speed band-mover).
build_optimal_plan/collect_optimal_candidates/ cost-model is alreadystructured like upstream
btultra2(const-folded optLevel,generation-stamped price caches, seed pass, frontier DP). The residual is
instruction-level — asm diff of the per-position pipeline vs upstream's
bmi2-specialised inner loop. Brings the small-input SPEED metric back intoband.
L18-L22 ratio (B). Dump our btultra / btultra2 sequence stream vs
upstream's on
decodecorpus-synthetic-1m+low-entropy-1mat L22, diffthe first divergence, attribute it to a price-model /
sufficient_match_len/ candidate-pruning difference vs upstream. Likely overlaps with lever 1
(same DP code). Gate:
rust_bytes <= ffi_byteson both synthetic fixturesat L18-L22.
Per-frame table zeroing (7%, speed).
MatchTable::resetfillshash_table+chain_table+hash3_tableevery frame. Either right-sizethe small-input tables to upstream's
ZSTD_adjustCParamsshape (blocked bythe deliberate 16 KiB hinted-window floor — ratio-sensitive, per-config
validation), or index-rebase instead of zeroing (match-table surgery,
roundtrip-gated).
Dictionary per-frame prime— DONE (prepared-dict snapshot/restore +cached BT dms tree + zero steady-state allocations/frame).
Tooling
zstd/examples/encode_loop_dict.rs— clean per-frame profile (dict /no-dict, reused context).
logs<N>reproduces the*-log-linesfixturesbyte-for-byte.
STRUCTURED_ZSTD_EMIT_REPORT=1oncompare_ffiprintsrust_bytesvsffi_bytesper fixture/level.compare_ffidict flamegraph (training / decodepollution).
Out of scope
P0 "incompressible early-out": measured non-issue — we already beat upstream
on incompressible L22; the raw-fast-path gate is correct as-is, do not touch.
Estimate
Total ~5-6d.