perf(encode): optimal parser (L16-L22) — small-input speed + L18-L22 synthetic-fixture ratio gap

## Problem

The btopt / btultra / btultra2 **optimal parser** (levels 16-22) has two
distinct gaps vs the upstream reference, both rooted in the same per-position
DP / match-selection code:

- **A. Speed** — slower than upstream **only on small compressible /
  dictionary inputs**; on large and on incompressible inputs we already match
  or beat it.
- **B. Ratio (L18-L22, synthetic fixtures)** — on adversarial synthetic 1 MiB
  fixtures our btultra / btultra2 output is ~1% **larger** than upstream. This
  was previously believed absent ("ratio is fine everywhere"); it is real but
  small and fixture-specific — synthetic patterns only, real corpus files stay
  `<=` upstream.

---

## A. Speed gap

Dashboard outliers (x86_64-gnu, L22 `btultra2`):
- `small-4k-log-lines` compress (no dict): `0.7523` (-24.77%)
- `small-10k-random` compress-dict: `0.3734` (-62.66%)

i9 bench host:
- `small-1k-random` compress (no dict): **2.9x faster** (we win)
- `high-entropy-1m` compress (no dict): **11.5x faster** (we win)
- `small-10k-random` compress-dict: rust 1.465 ms vs ref 0.485 ms = **3.0x
  slower** — the optimal-parser per-position cost is the whole residual (the
  CDict-equivalent + zero-alloc + dms-cache work already landed).

Profile (`encode_loop_dict`, non-dict L22 `logs4096`, 59k samples):
- `build_optimal_plan_impl` ~37%, `collect_optimal_candidates_initialized`
  ~16%, `bt_insert_and_collect_matches` ~10%, `hash3_candidate` ~5.6%
  => **optimal parser ~70%** (inherent L22 work).
- `__memset_avx2` **7%** = per-frame matcher-table zeroing on reset.
- `incompressible::update_sample_metrics` 3.6% = the raw-fast-path pre-parse
  sample (do NOT remove: it is exactly what gives the incompressible-small win).
- entropy (FSE / Huffman build) ~5%.

---

## B. Ratio gap (L18-L22, synthetic fixtures)

`compare_ffi` REPORT (`rust_bytes` vs `ffi_bytes`, lower = better; i9/M1):

| fixture | level | rust | ffi | verdict |
|---|---|---|---|---|
| decodecorpus-synthetic-1m | L16 btopt    | 233 | 234 | `<=` ok |
| decodecorpus-synthetic-1m | L18 btultra  | 231 | 228 | +3 |
| decodecorpus-synthetic-1m | L22 btultra2 | 231 | 228 | +3 |
| decodecorpus-synthetic-1m | L22 +ldm     | 232 | 228 | +4 |
| low-entropy-1m            | L16 btopt    | 148 | 146 | +2 |
| low-entropy-1m            | L18 btultra  | 146 | 143 | +3 |
| low-entropy-1m            | L22 btultra2 | 146 | 143 | +3 |
| large-log-stream (100 MB) | L18 btultra  | 8944 | 8953 | `<=` ok |
| large-log-stream (100 MB) | L22 btultra2 | 8943 | 8940 | +3 |

Characteristics:
- **PRE-EXISTING** — identical on the state before the block pre-splitter
  sampling-tier fix (confirmed by an A/B at the commit just before it), so it
  is NOT a pre-splitter regression; it is the optimal parser itself.
- **Fixture-specific** — only the SYNTHETIC adversarial fixtures
  (decodecorpus-synthetic, low-entropy patterned). On the real decode-corpus
  file `z000033` L22 stays `<=` upstream
  (`cross_validation::level22_stays_within_ffi_level22_on_corpus_proxy` passes).
- **Small** — ~1-2% on tiny (~230-byte) and 100 MB outputs alike.
- **Same code as A** — the btultra / btultra2 DP parser makes slightly worse
  match choices than upstream on these patterns.

---

## Levers

1. **Optimal-parser per-position cost (~70%, the speed band-mover).**
   `build_optimal_plan` / `collect_optimal_candidates` / cost-model is already
   structured like upstream `btultra2` (const-folded optLevel,
   generation-stamped price caches, seed pass, frontier DP). The residual is
   instruction-level — asm diff of the per-position pipeline vs upstream's
   `bmi2`-specialised inner loop. Brings the small-input SPEED metric back into
   band.

2. **L18-L22 ratio (B).** Dump our btultra / btultra2 sequence stream vs
   upstream's on `decodecorpus-synthetic-1m` + `low-entropy-1m` at L22, diff
   the first divergence, attribute it to a price-model / `sufficient_match_len`
   / candidate-pruning difference vs upstream. Likely overlaps with lever 1
   (same DP code). Gate: `rust_bytes <= ffi_bytes` on both synthetic fixtures
   at L18-L22.

3. **Per-frame table zeroing (7%, speed).** `MatchTable::reset` fills
   `hash_table` + `chain_table` + `hash3_table` every frame. Either right-size
   the small-input tables to upstream's `ZSTD_adjustCParams` shape (blocked by
   the deliberate 16 KiB hinted-window floor — ratio-sensitive, per-config
   validation), or index-rebase instead of zeroing (match-table surgery,
   roundtrip-gated).

4. ~~Dictionary per-frame prime~~ — **DONE** (prepared-dict snapshot/restore +
   cached BT dms tree + zero steady-state allocations/frame).

---

## Tooling

- `zstd/examples/encode_loop_dict.rs` — clean per-frame profile (dict /
  no-dict, reused context). `logs<N>` reproduces the `*-log-lines` fixtures
  byte-for-byte.
- For the ratio gap: `STRUCTURED_ZSTD_EMIT_REPORT=1` on `compare_ffi` prints
  `rust_bytes` vs `ffi_bytes` per fixture/level.
- Use these, not the `compare_ffi` dict flamegraph (training / decode
  pollution).

## Out of scope

P0 "incompressible early-out": measured non-issue — we already beat upstream
on incompressible L22; the raw-fast-path gate is correct as-is, do not touch.

## Estimate

- Lever 1 (speed instruction-parity): ~3-4d (deep asm-diff, multi-pass).
- Lever 2 (L18-L22 ratio): ~1d (sequence diff + price-model fix).
- Lever 3 (table zeroing): ~1d 4h.

Total ~5-6d.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(encode): optimal parser (L16-L22) — small-input speed + L18-L22 synthetic-fixture ratio gap #337

Problem

A. Speed gap

B. Ratio gap (L18-L22, synthetic fixtures)

Levers

Tooling

Out of scope

Estimate

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

fixture	level	rust	ffi	verdict
decodecorpus-synthetic-1m	L16 btopt	233	234	`<=` ok
decodecorpus-synthetic-1m	L18 btultra	231	228	+3
decodecorpus-synthetic-1m	L22 btultra2	231	228	+3
decodecorpus-synthetic-1m	L22 +ldm	232	228	+4
low-entropy-1m	L16 btopt	148	146	+2
low-entropy-1m	L18 btultra	146	143	+3
low-entropy-1m	L22 btultra2	146	143	+3
large-log-stream (100 MB)	L18 btultra	8944	8953	`<=` ok
large-log-stream (100 MB)	L22 btultra2	8943	8940	+3

Uh oh!

perf(encode): optimal parser (L16-L22) — small-input speed + L18-L22 synthetic-fixture ratio gap #337

Description

Problem

A. Speed gap

B. Ratio gap (L18-L22, synthetic fixtures)

Levers

Tooling

Out of scope

Estimate

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions