feat: add Streaming decoder — wuffs-style single-code architecture#69
feat: add Streaming decoder — wuffs-style single-code architecture#69lilith wants to merge 3 commits intoimage-rs:masterfrom
Conversation
380d0fe to
3e0b63b
Compare
Adds a `TableStrategy` enum to `Configuration` with a new Streaming decoder backend: a wuffs-style single-code-per-iteration decoder with PreQ+SufQ(Q=8) table layout, mini-burst literal fast path, and short-copy fast path for codes ≤ 8 bytes. Beats Classic on every real-world workload tested (3–35% on TIFF/GIF aggregates), reaching 75–85% of wuffs C throughput. Only loses on synthetic solid single-byte data where Classic's burst memcpy wins. Drop-in replacement: changing the strategy does not affect the public API or output correctness.
19 deterministic oracle tests: encode → decode through Classic and Streaming, assert byte-for-byte equality. Covers all min_code_size 0–12, both bit orders, GIF/TIFF, yield_on_full, small buffer sweep (1–64 bytes), KwKwK-heavy, table fill cycles, and lowbit edge cases. Fuzz target roundtrip_all: encode → decode through both strategies with fuzzer-chosen config. 200K+ runs clean.
66c38da to
5a52fd1
Compare
Investigation doc records the full arc: Wuffs audit, structural rewrite, experiments tried/failed, real-world corpora results. strategy_compare bench (zenbench): Classic vs Streaming across 8 workloads (random, palette, RLE, solid, photo-predicted; LSB + MSB TIFF). Self-contained, no external files needed. Run: cargo bench --bench strategy_compare
5a52fd1 to
61f0112
Compare
|
The MSRV issue can be fixed by dropping the benchmarks (zenbench requires 1.85) but I thought it best to keep those around for now |
|
Sure, will take a look. The LLM'ism of terming anything existing 'classic' is a constant mental tripping hazard. Particular in code comments or docs that is perfectly nondescript and going to be outdated soon and I'd rather not see it in the user facing enum names. I never know when it's really referring to established practice outside this implementation or if it's supposed to make some point. Also odd that the pathological KwKwK case is the one where chunk-based reading does not outperform. Is that because the link-walking in |
|
It's always tricky figuring out what to call an existing algorithm when slotting in an alternative - any ideas? |
|
Purely descriptive, |
|
I have adapted the wuffs suffix style into the v0.1 burst-loop with some additional adjustments. Verification oustanding but so are the results. See the |
|
The crucial innovation of the branch is a simplified symbol reconstruction when we have enough buffer space and are sure that a few bytes after suffix can be safely clobbered with garbage (eg. since they will get overwritten immediately again). That allows much faster 64-bit moves without having to use the depth for byte-precise copies. |
Per upstream reviewer feedback (image-rs#69): "Classic" is nondescript and will be outdated soon. ByteLink describes the data structure (link-chain byte reconstruction). Classic and Chunked are kept as deprecated #[doc(hidden)] aliases that map to ByteLink. Chunked's separate ChunkedTable is removed from the match arms — ByteLink now handles all non-Streaming configurations. Also adds code-level fit documentation to generators.rs acknowledging that earlier byte-stat-only synthetic data gave misleading perf comparisons between strategies.
|
Incredible work! I ran your branch against vastly improved synthetic data generation algorithms and it was broadly far superior. long solid color runs was the only exception, and even then it's plenty fast. |
#73 shows even better results, I think, than the prev oversimplified ones |
Summary
Adds a
TableStrategyenum toConfigurationwith a new Streaming decoder backend: a wuffs-style single-code-per-iteration decoder with a PreQ+SufQ(Q=8) table layout and mini-burst literal/short-copy fast paths.Supersedes #65 — Streaming beats Chunked on every tested workload. There is no need for a separate Chunked strategy.
Performance
All numbers from AMD Ryzen 9 7950X, stable Rust, default target (no `-C target-cpu=native`). Throughput on decoded output bytes.
Reproducible — `cargo bench --bench strategy_compare`:
Synthetic archetypes fitted to real corpus data via K-means clustering (k=5, n=49 TIFF files from 5 corpora) + Nelder-Mead optimization of generator parameters per cluster centroid. Each archetype reproduces the cluster's byte entropy, run-length distribution, repeat fraction, and LZW compression ratio.
Validated against 20 real TIFF files (see investigation branch). Synthetic archetype throughput ratios track real corpus within 10–20%.
Streaming wins on every real-world archetype. Classic only wins on solid single-byte data (pure KwKwK), which doesn't occur in real images.
What's included
Relationship to other PRs
Test plan