Skip to content

perf(encode): optimal parser (L16-L22) — small-input speed + L18-L22 synthetic-fixture ratio gap #337

Description

@polaz

Problem

The btopt / btultra / btultra2 optimal parser (levels 16-22) has two
distinct gaps vs the upstream reference, both rooted in the same per-position
DP / match-selection code:

  • A. Speed — slower than upstream only on small compressible /
    dictionary inputs
    ; on large and on incompressible inputs we already match
    or beat it.
  • B. Ratio (L18-L22, synthetic fixtures) — on adversarial synthetic 1 MiB
    fixtures our btultra / btultra2 output is ~1% larger than upstream. This
    was previously believed absent ("ratio is fine everywhere"); it is real but
    small and fixture-specific — synthetic patterns only, real corpus files stay
    <= upstream.

A. Speed gap

Dashboard outliers (x86_64-gnu, L22 btultra2):

  • small-4k-log-lines compress (no dict): 0.7523 (-24.77%)
  • small-10k-random compress-dict: 0.3734 (-62.66%)

i9 bench host:

  • small-1k-random compress (no dict): 2.9x faster (we win)
  • high-entropy-1m compress (no dict): 11.5x faster (we win)
  • small-10k-random compress-dict: rust 1.465 ms vs ref 0.485 ms = 3.0x
    slower
    — the optimal-parser per-position cost is the whole residual (the
    CDict-equivalent + zero-alloc + dms-cache work already landed).

Profile (encode_loop_dict, non-dict L22 logs4096, 59k samples):

  • build_optimal_plan_impl ~37%, collect_optimal_candidates_initialized
    ~16%, bt_insert_and_collect_matches ~10%, hash3_candidate ~5.6%
    => optimal parser ~70% (inherent L22 work).
  • __memset_avx2 7% = per-frame matcher-table zeroing on reset.
  • incompressible::update_sample_metrics 3.6% = the raw-fast-path pre-parse
    sample (do NOT remove: it is exactly what gives the incompressible-small win).
  • entropy (FSE / Huffman build) ~5%.

B. Ratio gap (L18-L22, synthetic fixtures)

compare_ffi REPORT (rust_bytes vs ffi_bytes, lower = better; i9/M1):

fixture level rust ffi verdict
decodecorpus-synthetic-1m L16 btopt 233 234 <= ok
decodecorpus-synthetic-1m L18 btultra 231 228 +3
decodecorpus-synthetic-1m L22 btultra2 231 228 +3
decodecorpus-synthetic-1m L22 +ldm 232 228 +4
low-entropy-1m L16 btopt 148 146 +2
low-entropy-1m L18 btultra 146 143 +3
low-entropy-1m L22 btultra2 146 143 +3
large-log-stream (100 MB) L18 btultra 8944 8953 <= ok
large-log-stream (100 MB) L22 btultra2 8943 8940 +3

Characteristics:

  • PRE-EXISTING — identical on the state before the block pre-splitter
    sampling-tier fix (confirmed by an A/B at the commit just before it), so it
    is NOT a pre-splitter regression; it is the optimal parser itself.
  • Fixture-specific — only the SYNTHETIC adversarial fixtures
    (decodecorpus-synthetic, low-entropy patterned). On the real decode-corpus
    file z000033 L22 stays <= upstream
    (cross_validation::level22_stays_within_ffi_level22_on_corpus_proxy passes).
  • Small — ~1-2% on tiny (~230-byte) and 100 MB outputs alike.
  • Same code as A — the btultra / btultra2 DP parser makes slightly worse
    match choices than upstream on these patterns.

Levers

  1. Optimal-parser per-position cost (~70%, the speed band-mover).
    build_optimal_plan / collect_optimal_candidates / cost-model is already
    structured like upstream btultra2 (const-folded optLevel,
    generation-stamped price caches, seed pass, frontier DP). The residual is
    instruction-level — asm diff of the per-position pipeline vs upstream's
    bmi2-specialised inner loop. Brings the small-input SPEED metric back into
    band.

  2. L18-L22 ratio (B). Dump our btultra / btultra2 sequence stream vs
    upstream's on decodecorpus-synthetic-1m + low-entropy-1m at L22, diff
    the first divergence, attribute it to a price-model / sufficient_match_len
    / candidate-pruning difference vs upstream. Likely overlaps with lever 1
    (same DP code). Gate: rust_bytes <= ffi_bytes on both synthetic fixtures
    at L18-L22.

  3. Per-frame table zeroing (7%, speed). MatchTable::reset fills
    hash_table + chain_table + hash3_table every frame. Either right-size
    the small-input tables to upstream's ZSTD_adjustCParams shape (blocked by
    the deliberate 16 KiB hinted-window floor — ratio-sensitive, per-config
    validation), or index-rebase instead of zeroing (match-table surgery,
    roundtrip-gated).

  4. Dictionary per-frame primeDONE (prepared-dict snapshot/restore +
    cached BT dms tree + zero steady-state allocations/frame).


Tooling

  • zstd/examples/encode_loop_dict.rs — clean per-frame profile (dict /
    no-dict, reused context). logs<N> reproduces the *-log-lines fixtures
    byte-for-byte.
  • For the ratio gap: STRUCTURED_ZSTD_EMIT_REPORT=1 on compare_ffi prints
    rust_bytes vs ffi_bytes per fixture/level.
  • Use these, not the compare_ffi dict flamegraph (training / decode
    pollution).

Out of scope

P0 "incompressible early-out": measured non-issue — we already beat upstream
on incompressible L22; the raw-fast-path gate is correct as-is, do not touch.

Estimate

  • Lever 1 (speed instruction-parity): ~3-4d (deep asm-diff, multi-pass).
  • Lever 2 (L18-L22 ratio): ~1d (sequence diff + price-model fix).
  • Lever 3 (table zeroing): ~1d 4h.

Total ~5-6d.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2-mediumMedium priority — important improvementenhancementNew feature or requestperformancePerformance optimization

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions