Skip to content

refactor: pluggable wire encoders and remove Pipeline::Lzr#118

Merged
ChrisLundquist merged 2 commits into
masterfrom
claude/recursing-vaughan
Mar 10, 2026
Merged

refactor: pluggable wire encoders and remove Pipeline::Lzr#118
ChrisLundquist merged 2 commits into
masterfrom
claude/recursing-vaughan

Conversation

@ChrisLundquist
Copy link
Copy Markdown
Owner

Summary

  • Pluggable wire encoders: New TokenEncoder trait with 3 implementations (Lz77Encoder, LzSeqEncoder, LzssEncoder). Match finding is decoupled from wire encoding via a universal LzToken type and tokenize() entry point.
  • Remove Pipeline::Lzr: After the wire encoder refactor, Lzr (LZ77 + rANS) was identical to LzSeqR. Pipeline ID 3 is reserved with a tombstone comment to prevent reuse.
  • Upgrade Lzf wire encoding: Lzf switched from LzDemuxer::Lz77 (3 streams, ~41% ratio) to LzDemuxer::LzSeq (6 streams, ~32% ratio).
  • Simplify SortLz: Replaced hand-rolled FSE stream management with LzSeqEncoder, removing ~164 lines of duplicate encode/decode logic. Wire format v2.
  • max_match_len propagation: Non-Deflate LZ pipelines auto-default to u16::MAX via adjusted_options(), threaded through SeqConfigHashChainFinder.

Key files

File Change
src/lz_token.rs NEWLzToken, EncodedStreams, TokenEncoder trait, 3 encoder impls
src/lzseq/mod.rs encode_from_tokens(), max_match_len in SeqConfig
src/pipeline/demux.rs encoder_for_demuxer() dispatch, demux_lz77_matches returns PzResult
src/pipeline/mod.rs tokenize() replaces lz77_matches_with_backend(), Lzr removed
src/sortlz.rs Uses LzSeqEncoder for wire encoding (-164 lines)
27 more files Lzr removal from tests, CLI, benchmarks, scripts, examples, fuzz

Test plan

  • cargo clippy --all-targets — zero warnings
  • cargo test — 706 tests pass, 0 failures
  • Pre-commit hooks pass (fmt, clippy, test)
  • Benchmarked: Lzr/Lzf ratio improved from ~41% to ~32%, throughput unchanged
  • Verify no regressions on CI

🤖 Generated with Claude Code

ChrisLundquist and others added 2 commits March 10, 2026 01:52
Two interleaved changes:

1. Pluggable wire encoders: New `src/lz_token.rs` with universal `LzToken`
   type, `TokenEncoder` trait, and three encoder implementations:
   - `Lz77Encoder`: DEFLATE-compatible 3-stream format
   - `LzSeqEncoder`: log2-coded 6-stream format (best ratio)
   - `LzssEncoder`: flag-based 4-stream format

   Match finders now produce `Vec<LzToken>` via `tokenize()`, and encoders
   convert token streams to independent byte streams for entropy coding.
   This decouples match finding from wire encoding.

2. Remove Pipeline::Lzr: After the wire encoder refactor, Lzr became
   identical to LzSeqR (same demuxer, match finder, wire encoder, and
   entropy coder). Removed from enum, dispatch tables, CLI, tests,
   benchmarks, examples, scripts, and fuzz targets. Pipeline ID 3
   reserved with tombstone comment.

Additionally, Lzf's demuxer switches from Lz77 to LzSeq, upgrading its
compression ratio from ~41% to ~32% on typical data.

Wire format break (pre-1.0): SortLz now uses LzSeq-encoded streams + FSE
instead of hand-rolled flag/offset/length FSE streams.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add debug_assert for input_pos bounds in Lz77Encoder::encode
- Wire SeqConfig.max_match_len through encode_from_tokens
- Return PzResult from demux_lz77_matches instead of panicking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit d6736d6 into master Mar 10, 2026
4 checks passed
ChrisLundquist added a commit that referenced this pull request Mar 12, 2026
Add architecture section documenting the unified token pipeline (PR #118),
active/removed pipelines table, and Silesia corpus benchmark data. Update
project layout to reflect lz_token.rs and removed modules. Update dead ends
with streaming path bottleneck finding and LzSeqR routing bug (PR #120).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ChrisLundquist added a commit that referenced this pull request Mar 12, 2026
* bench: enable parallel, large, and webgpu benchmarks for Lzfi

Lzfi was only benchmarked on the small Canterbury corpus with no
parallel, large-file, or WebGPU variants. Enable all modes to match
the LzSeqR and Lzf benchmark coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: update CLAUDE.md with architecture overview and Silesia benchmarks

Add architecture section documenting the unified token pipeline (PR #118),
active/removed pipelines table, and Silesia corpus benchmark data. Update
project layout to reflect lz_token.rs and removed modules. Update dead ends
with streaming path bottleneck finding and LzSeqR routing bug (PR #120).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant