feat: add Streaming decoder — wuffs-style single-code architecture by lilith · Pull Request #69 · image-rs/weezl

lilith · 2026-04-09T07:51:54Z

Summary

Adds a TableStrategy enum to Configuration with a new Streaming decoder backend: a wuffs-style single-code-per-iteration decoder with a PreQ+SufQ(Q=8) table layout and mini-burst literal/short-copy fast paths.

Supersedes #65 — Streaming beats Chunked on every tested workload. There is no need for a separate Chunked strategy.

Performance

All numbers from AMD Ryzen 9 7950X, stable Rust, default target (no `-C target-cpu=native`). Throughput on decoded output bytes.

Reproducible — `cargo bench --bench strategy_compare`:

Synthetic archetypes fitted to real corpus data via K-means clustering (k=5, n=49 TIFF files from 5 corpora) + Nelder-Mead optimization of generator parameters per cluster centroid. Each archetype reproduces the cluster's byte entropy, run-length distribution, repeat fraction, and LZW compression ratio.

Archetype	Cluster (n files)	Classic	Streaming	Δ
Flat UI (TIFF)	C0: terminals, simple UIs (14)	900 MiB/s	2.06 GiB/s	+129%
Rich screenshot (TIFF)	C1: web pages, IDEs (8)	460 MiB/s	557 MiB/s	+21%
Photo + predictor (TIFF)	C2: TIFF photos w/ horiz diff (11)	296 MiB/s	350 MiB/s	+18%
Photo raw (TIFF)	C4: uncompressed photos (13)	189 MiB/s	209 MiB/s	+11%
Solid / KwKwK (TIFF)	pathological single-byte	19.3 GiB/s	6.78 GiB/s	-65%
Flat UI (GIF)	C0 in LSB mode	900 MiB/s	2.06 GiB/s	+129%
Rich screenshot (GIF)	C1 in LSB mode	466 MiB/s	652 MiB/s	+40%

Validated against 20 real TIFF files (see investigation branch). Synthetic archetype throughput ratios track real corpus within 10–20%.

Streaming wins on every real-world archetype. Classic only wins on solid single-byte data (pure KwKwK), which doesn't occur in real images.

What's included

Commit	Files	Description
1	`src/decode.rs`, `src/lib.rs`	`TableStrategy` enum, `DecodeStateStreaming`, 4 monomorphizations
2	`tests/`, `fuzz/`	19 parity tests + fuzz target (200K+ runs clean)
3	`benches/`, `docs/`, `Cargo.toml`	Corpus-fitted zenbench benchmark + investigation doc

Relationship to other PRs

Supersedes feat: chunked decode table (rebased) + reproducible corpus bench #65 (Chunked)
Independent of feat: support min_code_size 0 and 1 #67 (lowbit): has its own `bump_if_lowbit`
References yield_on_full + small output buffer: false NoProgress loses data #68 (yield_on_full bug): parity tests document a pre-existing Classic bug; Streaming handles it correctly

Test plan

`cargo test --release` — all existing + 19 new parity tests
`cargo test --tests --benches --no-default-features --features ""` — no-features build
`cargo test --release --no-default-features --features alloc` — alloc-only build
Fuzz target `roundtrip_all` clean (200K+ runs)
`cargo bench --bench strategy_compare` — self-contained, reproducible, corpus-fitted
No clippy warnings from new code
CI on this PR

Adds a `TableStrategy` enum to `Configuration` with a new Streaming decoder backend: a wuffs-style single-code-per-iteration decoder with PreQ+SufQ(Q=8) table layout, mini-burst literal fast path, and short-copy fast path for codes ≤ 8 bytes. Beats Classic on every real-world workload tested (3–35% on TIFF/GIF aggregates), reaching 75–85% of wuffs C throughput. Only loses on synthetic solid single-byte data where Classic's burst memcpy wins. Drop-in replacement: changing the strategy does not affect the public API or output correctness.

19 deterministic oracle tests: encode → decode through Classic and Streaming, assert byte-for-byte equality. Covers all min_code_size 0–12, both bit orders, GIF/TIFF, yield_on_full, small buffer sweep (1–64 bytes), KwKwK-heavy, table fill cycles, and lowbit edge cases. Fuzz target roundtrip_all: encode → decode through both strategies with fuzzer-chosen config. 200K+ runs clean.

Investigation doc records the full arc: Wuffs audit, structural rewrite, experiments tried/failed, real-world corpora results. strategy_compare bench (zenbench): Classic vs Streaming across 8 workloads (random, palette, RLE, solid, photo-predicted; LSB + MSB TIFF). Self-contained, no external files needed. Run: cargo bench --bench strategy_compare

lilith · 2026-04-10T04:10:47Z

The MSRV issue can be fixed by dropping the benchmarks (zenbench requires 1.85) but I thought it best to keep those around for now

197g · 2026-04-10T19:31:18Z

Sure, will take a look. The LLM'ism of terming anything existing 'classic' is a constant mental tripping hazard. Particular in code comments or docs that is perfectly nondescript and going to be outdated soon and I'd rather not see it in the user facing enum names. I never know when it's really referring to established practice outside this implementation or if it's supposed to make some point.

Also odd that the pathological KwKwK case is the one where chunk-based reading does not outperform. Is that because the link-walking in first_of that the implementation with Link avoids?

lilith · 2026-04-10T20:28:06Z

It's always tricky figuring out what to call an existing algorithm when slotting in an alternative - any ideas?

197g · 2026-04-10T20:48:27Z

Purely descriptive, ByteLink, or the version it was introduced at maybe. Using weezl to describe it via the origin like 'wuffs-style' would just be confusing. So what's left is that it should be worded after a particular strength and here it doesn't seem to have much left. Well, I'll have a few shots at looking at the two structures of both and seeing if anything comes to mind (it seems that this reverted #63 inadvertently for instance).

197g · 2026-04-12T21:32:22Z

I have adapted the wuffs suffix style into the v0.1 burst-loop with some additional adjustments. Verification oustanding but so are the results. See the wuffs branch.

═══════════════════════════════════════════════════════════════

  tiff/flat-ui  40 rounds × 78 calls
                 mean ±mad µs  95% CI vs base      iB/s
  ├─ classic     54.1 ±0.8µs  [53.7–54.4]µs      4.52G
  ╰─ streaming   64.4 ±2.3µs  [+17.4%–+20.7%]    3.79G

  classic    ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4.52 GiB/s
  streaming  █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 3.79 GiB/s

  tiff/rich-screenshot  30 rounds × 12 calls
                  mean ±mad µs  95% CI vs base      iB/s
  ├─ classic     259.0 ±4.1µs  [257.9–260.1]µs     965M
  ╰─ streaming   401.8 ±5.5µs  [+54.0%–+56.3%]     622M

  classic    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 965 MiB/s
  streaming  ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 622 MiB/s

  tiff/photo-predicted  30 rounds × 7 calls
                   mean ±mad µs  95% CI vs base      iB/s
  ├─ classic      468.3 ±9.3µs  [465.7–470.9]µs     534M
  ╰─ streaming   677.1 ±12.5µs  [+43.3%–+45.9%]     369M

  classic    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 534 MiB/s
  streaming  ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 369 MiB/s

  tiff/photo-raw  50 rounds × 4 calls
                mean ±mad ms  95% CI vs base   iB/s
  ├─ classic     1.1 ±0.0ms  [1.0–1.1]ms      237M
  ╰─ streaming   1.1 ±0.0ms  [-1.0%–+0.8%]    237M [1]

  streaming  █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 237 MiB/s
  classic    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 237 MiB/s
  [1] CI crosses zero

  tiff/solid-kwkwk  30 rounds × 35 calls
                 mean ±mad µs  95% CI vs base        iB/s
  ├─ classic     12.6 ±0.1µs  [12.5–12.6]µs        19.4G
  ╰─ streaming   33.3 ±0.4µs  [+163.6%–+166.1%]    7.33G

  classic    ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 19.4 GiB/s
  streaming  ██████████████████████████████████████████████████████████████████████████████████ 7.33 GiB/s

  gif/flat-ui  110 rounds × 24 calls
                 mean ±mad µs  95% CI vs base      iB/s
  ├─ classic     56.0 ±1.6µs  [55.7–56.4]µs      4.36G
  ╰─ streaming   62.6 ±3.2µs  [+10.5%–+12.7%]    3.90G

  classic    ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4.36 GiB/s
  streaming  █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 3.90 GiB/s

  gif/rich-screenshot  30 rounds × 13 calls
                  mean ±mad µs  95% CI vs base      iB/s
  ├─ classic     258.0 ±5.3µs  [256.5–259.5]µs     969M
  ╰─ streaming   362.3 ±9.4µs  [+39.1%–+41.7%]     690M

  classic    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 969 MiB/s
  streaming  ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 690 MiB/s

  total: 21.4s  (33 noisy rounds)
═══════════════════════════════════════════════════════════════
  filter: cargo bench -- --group=NAME  format: --format=llm|csv|md|json

197g · 2026-04-12T21:35:26Z

The crucial innovation of the branch is a simplified symbol reconstruction when we have enough buffer space and are sure that a few bytes after suffix can be safely clobbered with garbage (eg. since they will get overwritten immediately again). That allows much faster 64-bit moves without having to use the depth for byte-precise copies.

Per upstream reviewer feedback (image-rs#69): "Classic" is nondescript and will be outdated soon. ByteLink describes the data structure (link-chain byte reconstruction). Classic and Chunked are kept as deprecated #[doc(hidden)] aliases that map to ByteLink. Chunked's separate ChunkedTable is removed from the match arms — ByteLink now handles all non-Streaming configurations. Also adds code-level fit documentation to generators.rs acknowledging that earlier byte-stat-only synthetic data gave misleading perf comparisons between strategies.

lilith · 2026-04-13T03:06:01Z

Incredible work! I ran your branch against vastly improved synthetic data generation algorithms and it was broadly far superior. long solid color runs was the only exception, and even then it's plenty fast.

lilith · 2026-04-13T05:54:07Z

test: failing regression tests for yield_on_full data loss (#68) #71 — Fixes the yield_on_full data loss bug on master. Fuzz-found, includes regression tests. test: failing regression tests for yield_on_full data loss (#68) #71
fix: yield_on_full correctness on wuffs branch (#68) #72 — Same fix for your wuffs branch, plus a debug_assert panic fix and a reconstruct_simple overwrite guard for tight buffers.
fix: yield_on_full correctness on wuffs branch (#68) #72
bench: code-level-fitted synthetic generators #73 — Better benchmark generators. The old ones matched byte stats but produced LZW code streams that looked nothing like real data
(width_12 was off by 99%). These use pattern libraries and row-repeat templates to match code-level features too.
bench: code-level-fitted synthetic generators #73 (also targets your wuffs branch)

#73 shows even better results, I think, than the prev oversimplified ones

lilith force-pushed the streaming-decoder branch 5 times, most recently from 380d0fe to 3e0b63b Compare April 9, 2026 09:05

lilith added 2 commits April 9, 2026 03:27

lilith force-pushed the streaming-decoder branch 3 times, most recently from 66c38da to 5a52fd1 Compare April 9, 2026 10:35

lilith force-pushed the streaming-decoder branch from 5a52fd1 to 61f0112 Compare April 9, 2026 11:09

lilith mentioned this pull request Apr 9, 2026

yield_on_full + small output buffer: false NoProgress loses data #68

Open

lilith closed this Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Streaming decoder — wuffs-style single-code architecture#69

feat: add Streaming decoder — wuffs-style single-code architecture#69
lilith wants to merge 3 commits intoimage-rs:masterfrom
lilith:streaming-decoder

lilith commented Apr 9, 2026 •

edited

Loading

Uh oh!

lilith commented Apr 10, 2026

Uh oh!

197g commented Apr 10, 2026

Uh oh!

lilith commented Apr 10, 2026

Uh oh!

197g commented Apr 10, 2026

Uh oh!

197g commented Apr 12, 2026

Uh oh!

197g commented Apr 12, 2026

Uh oh!

lilith commented Apr 13, 2026

Uh oh!

lilith commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lilith commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

What's included

Relationship to other PRs

Test plan

Uh oh!

lilith commented Apr 10, 2026

Uh oh!

197g commented Apr 10, 2026

Uh oh!

lilith commented Apr 10, 2026

Uh oh!

197g commented Apr 10, 2026

Uh oh!

197g commented Apr 12, 2026

Uh oh!

197g commented Apr 12, 2026

Uh oh!

lilith commented Apr 13, 2026

Uh oh!

lilith commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lilith commented Apr 9, 2026 •

edited

Loading