Skip to content

feat: add Streaming decoder — wuffs-style single-code architecture#69

Closed
lilith wants to merge 3 commits intoimage-rs:masterfrom
lilith:streaming-decoder
Closed

feat: add Streaming decoder — wuffs-style single-code architecture#69
lilith wants to merge 3 commits intoimage-rs:masterfrom
lilith:streaming-decoder

Conversation

@lilith
Copy link
Copy Markdown
Contributor

@lilith lilith commented Apr 9, 2026

Summary

Adds a TableStrategy enum to Configuration with a new Streaming decoder backend: a wuffs-style single-code-per-iteration decoder with a PreQ+SufQ(Q=8) table layout and mini-burst literal/short-copy fast paths.

Supersedes #65 — Streaming beats Chunked on every tested workload. There is no need for a separate Chunked strategy.

Performance

All numbers from AMD Ryzen 9 7950X, stable Rust, default target (no `-C target-cpu=native`). Throughput on decoded output bytes.

Reproducible — `cargo bench --bench strategy_compare`:

Synthetic archetypes fitted to real corpus data via K-means clustering (k=5, n=49 TIFF files from 5 corpora) + Nelder-Mead optimization of generator parameters per cluster centroid. Each archetype reproduces the cluster's byte entropy, run-length distribution, repeat fraction, and LZW compression ratio.

Archetype Cluster (n files) Classic Streaming Δ
Flat UI (TIFF) C0: terminals, simple UIs (14) 900 MiB/s 2.06 GiB/s +129%
Rich screenshot (TIFF) C1: web pages, IDEs (8) 460 MiB/s 557 MiB/s +21%
Photo + predictor (TIFF) C2: TIFF photos w/ horiz diff (11) 296 MiB/s 350 MiB/s +18%
Photo raw (TIFF) C4: uncompressed photos (13) 189 MiB/s 209 MiB/s +11%
Solid / KwKwK (TIFF) pathological single-byte 19.3 GiB/s 6.78 GiB/s -65%
Flat UI (GIF) C0 in LSB mode 900 MiB/s 2.06 GiB/s +129%
Rich screenshot (GIF) C1 in LSB mode 466 MiB/s 652 MiB/s +40%

Validated against 20 real TIFF files (see investigation branch). Synthetic archetype throughput ratios track real corpus within 10–20%.

Streaming wins on every real-world archetype. Classic only wins on solid single-byte data (pure KwKwK), which doesn't occur in real images.

What's included

Commit Files Description
1 `src/decode.rs`, `src/lib.rs` `TableStrategy` enum, `DecodeStateStreaming`, 4 monomorphizations
2 `tests/`, `fuzz/` 19 parity tests + fuzz target (200K+ runs clean)
3 `benches/`, `docs/`, `Cargo.toml` Corpus-fitted zenbench benchmark + investigation doc

Relationship to other PRs

Test plan

  • `cargo test --release` — all existing + 19 new parity tests
  • `cargo test --tests --benches --no-default-features --features ""` — no-features build
  • `cargo test --release --no-default-features --features alloc` — alloc-only build
  • Fuzz target `roundtrip_all` clean (200K+ runs)
  • `cargo bench --bench strategy_compare` — self-contained, reproducible, corpus-fitted
  • No clippy warnings from new code
  • CI on this PR

@lilith lilith force-pushed the streaming-decoder branch 5 times, most recently from 380d0fe to 3e0b63b Compare April 9, 2026 09:05
lilith added 2 commits April 9, 2026 03:27
Adds a `TableStrategy` enum to `Configuration` with a new Streaming
decoder backend: a wuffs-style single-code-per-iteration decoder with
PreQ+SufQ(Q=8) table layout, mini-burst literal fast path, and
short-copy fast path for codes ≤ 8 bytes.

Beats Classic on every real-world workload tested (3–35% on TIFF/GIF
aggregates), reaching 75–85% of wuffs C throughput. Only loses on
synthetic solid single-byte data where Classic's burst memcpy wins.

Drop-in replacement: changing the strategy does not affect the
public API or output correctness.
19 deterministic oracle tests: encode → decode through Classic and
Streaming, assert byte-for-byte equality. Covers all min_code_size
0–12, both bit orders, GIF/TIFF, yield_on_full, small buffer sweep
(1–64 bytes), KwKwK-heavy, table fill cycles, and lowbit edge cases.

Fuzz target roundtrip_all: encode → decode through both strategies
with fuzzer-chosen config. 200K+ runs clean.
@lilith lilith force-pushed the streaming-decoder branch 3 times, most recently from 66c38da to 5a52fd1 Compare April 9, 2026 10:35
Investigation doc records the full arc: Wuffs audit, structural
rewrite, experiments tried/failed, real-world corpora results.

strategy_compare bench (zenbench): Classic vs Streaming across 8
workloads (random, palette, RLE, solid, photo-predicted; LSB + MSB
TIFF). Self-contained, no external files needed.

Run: cargo bench --bench strategy_compare
@lilith
Copy link
Copy Markdown
Contributor Author

lilith commented Apr 10, 2026

The MSRV issue can be fixed by dropping the benchmarks (zenbench requires 1.85) but I thought it best to keep those around for now

@197g
Copy link
Copy Markdown
Member

197g commented Apr 10, 2026

Sure, will take a look. The LLM'ism of terming anything existing 'classic' is a constant mental tripping hazard. Particular in code comments or docs that is perfectly nondescript and going to be outdated soon and I'd rather not see it in the user facing enum names. I never know when it's really referring to established practice outside this implementation or if it's supposed to make some point.

Also odd that the pathological KwKwK case is the one where chunk-based reading does not outperform. Is that because the link-walking in first_of that the implementation with Link avoids?

@lilith
Copy link
Copy Markdown
Contributor Author

lilith commented Apr 10, 2026

It's always tricky figuring out what to call an existing algorithm when slotting in an alternative - any ideas?

@197g
Copy link
Copy Markdown
Member

197g commented Apr 10, 2026

Purely descriptive, ByteLink, or the version it was introduced at maybe. Using weezl to describe it via the origin like 'wuffs-style' would just be confusing. So what's left is that it should be worded after a particular strength and here it doesn't seem to have much left. Well, I'll have a few shots at looking at the two structures of both and seeing if anything comes to mind (it seems that this reverted #63 inadvertently for instance).

@197g
Copy link
Copy Markdown
Member

197g commented Apr 12, 2026

I have adapted the wuffs suffix style into the v0.1 burst-loop with some additional adjustments. Verification oustanding but so are the results. See the wuffs branch.

═══════════════════════════════════════════════════════════════

  tiff/flat-ui  40 rounds × 78 calls
                 mean ±mad µs  95% CI vs base      iB/s
  ├─ classic     54.1 ±0.8µs  [53.7–54.4]µs      4.52G
  ╰─ streaming   64.4 ±2.3µs  [+17.4%–+20.7%]    3.79G

  classic    ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4.52 GiB/s
  streaming  █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 3.79 GiB/s

  tiff/rich-screenshot  30 rounds × 12 calls
                  mean ±mad µs  95% CI vs base      iB/s
  ├─ classic     259.0 ±4.1µs  [257.9–260.1]µs     965M
  ╰─ streaming   401.8 ±5.5µs  [+54.0%–+56.3%]     622M

  classic    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 965 MiB/s
  streaming  ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 622 MiB/s

  tiff/photo-predicted  30 rounds × 7 calls
                   mean ±mad µs  95% CI vs base      iB/s
  ├─ classic      468.3 ±9.3µs  [465.7–470.9]µs     534M
  ╰─ streaming   677.1 ±12.5µs  [+43.3%–+45.9%]     369M

  classic    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 534 MiB/s
  streaming  ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 369 MiB/s

  tiff/photo-raw  50 rounds × 4 calls
                mean ±mad ms  95% CI vs base   iB/s
  ├─ classic     1.1 ±0.0ms  [1.0–1.1]ms      237M
  ╰─ streaming   1.1 ±0.0ms  [-1.0%–+0.8%]    237M [1]

  streaming  █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 237 MiB/s
  classic    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 237 MiB/s
  [1] CI crosses zero

  tiff/solid-kwkwk  30 rounds × 35 calls
                 mean ±mad µs  95% CI vs base        iB/s
  ├─ classic     12.6 ±0.1µs  [12.5–12.6]µs        19.4G
  ╰─ streaming   33.3 ±0.4µs  [+163.6%–+166.1%]    7.33G

  classic    ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 19.4 GiB/s
  streaming  ██████████████████████████████████████████████████████████████████████████████████ 7.33 GiB/s

  gif/flat-ui  110 rounds × 24 calls
                 mean ±mad µs  95% CI vs base      iB/s
  ├─ classic     56.0 ±1.6µs  [55.7–56.4]µs      4.36G
  ╰─ streaming   62.6 ±3.2µs  [+10.5%–+12.7%]    3.90G

  classic    ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4.36 GiB/s
  streaming  █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 3.90 GiB/s

  gif/rich-screenshot  30 rounds × 13 calls
                  mean ±mad µs  95% CI vs base      iB/s
  ├─ classic     258.0 ±5.3µs  [256.5–259.5]µs     969M
  ╰─ streaming   362.3 ±9.4µs  [+39.1%–+41.7%]     690M

  classic    █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 969 MiB/s
  streaming  ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 690 MiB/s

  total: 21.4s  (33 noisy rounds)
═══════════════════════════════════════════════════════════════
  filter: cargo bench -- --group=NAME  format: --format=llm|csv|md|json

@197g
Copy link
Copy Markdown
Member

197g commented Apr 12, 2026

The crucial innovation of the branch is a simplified symbol reconstruction when we have enough buffer space and are sure that a few bytes after suffix can be safely clobbered with garbage (eg. since they will get overwritten immediately again). That allows much faster 64-bit moves without having to use the depth for byte-precise copies.

lilith added a commit to lilith/weezl that referenced this pull request Apr 12, 2026
Per upstream reviewer feedback (image-rs#69): "Classic" is
nondescript and will be outdated soon. ByteLink describes the data
structure (link-chain byte reconstruction).

Classic and Chunked are kept as deprecated #[doc(hidden)] aliases
that map to ByteLink. Chunked's separate ChunkedTable is removed
from the match arms — ByteLink now handles all non-Streaming
configurations.

Also adds code-level fit documentation to generators.rs
acknowledging that earlier byte-stat-only synthetic data gave
misleading perf comparisons between strategies.
@lilith
Copy link
Copy Markdown
Contributor Author

lilith commented Apr 13, 2026

Incredible work! I ran your branch against vastly improved synthetic data generation algorithms and it was broadly far superior. long solid color runs was the only exception, and even then it's plenty fast.

@lilith
Copy link
Copy Markdown
Contributor Author

lilith commented Apr 13, 2026

#73 shows even better results, I think, than the prev oversimplified ones

@lilith lilith closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants