perf(ans): replace unconditional state normalization w/ explicit branch#704
perf(ans): replace unconditional state normalization w/ explicit branch#704xangelix wants to merge 2 commits intolibjxl:mainfrom
Conversation
Benchmark @ 0d7fe8dComparing: e883140e (Base) vs d9cb52d4 (PR)
|
|
From a quick benchmark, I do not see nearly as significant performance gains as you do -- in particular I am not confident that the gains I see are above the noise threshold, at least on my testing pixel7a and on AMD Ryzen Threadripper 7970X 32-Cores and AMD Ryzen AI 9 HX 370. In libjxl, we observed the branchy and branchless versions to be roughly equivalent in performance, with one of the two having a slight edge depending on the specific architecture. One thing to be careful about is that the benchmark is very sensitive to CPU performance fluctuations - you should try to make sure that the system is set to performance mode, and ideally that as few other processes run on the system as possible, and that dynamic frequency scaling is disabled to get the most consistent results. |
Changes
Replaces the unconditionally executed state calculation in
AnsHistogram::readwith an explicitifstatement for ANS state normalization.The previous implementation attempted to be branchless (?) by evaluating the appended state on every loop iteration. However, this forced
BitReader::peek(16)to execute on every single decoded symbol, incurring heavy logic via theBitReader's internal buffer bounds-checks. Using an explicit branch allows the CPU's branch predictor to bypass theBitReaderlogic entirely when normalization is not required.System Details
Kernel: Linux 6.19.6-2-cachyos
CPU: AMD Ryzen 9 9950X3D (32) @ 5.85 GHz
rustc: 1.95.0-nightly (873b4beb0 2026-02-15)
As compared to e883140.
I seem to get a fair amount of noise run-to-run on my system with criterion, so I'd love some validation of these numbers on other systems!
Local Comparison Test Code
Localized Results
Full Decode Results
Top 5 / Bottom 5
Top 5
squeeze_edge.jxlTime Change: −21.36%
Throughput Change: +27.17%
squeeze_alpha.jxlTime Change: −18.67%
Throughput Change: +22.95%
conformance_test_images/patches.jxlTime Change: −8.69%
Throughput Change: +9.52%
conformance_test_images/patches_5.jxlTime Change: −8.53%
Throughput Change: +9.33%
conformance_test_images/bicycles.jxlTime Change: −8.44%
Throughput Change: +9.22%
Bottom 5
stp2_520x260_d25_e6.jxlTime Change: +3.66%
Throughput Change: −3.53%
green_queen_modular_e3.jxlTime Change: +3.65%
Throughput Change: −3.52%
conformance_test_images/lossless_pfm.jxlTime Change: +3.54%
Throughput Change: −3.42%
conformance_test_images/alpha_nonpremultiplied.jxlTime Change: +2.18%
Throughput Change: −2.14%
lossy_with_icc.jxlTime Change: +2.08%
Throughput Change: −2.04%