Skip to content

Improve has_constant_alpha* methods#2941

Open
RunDevelopment wants to merge 4 commits into
image-rs:mainfrom
RunDevelopment:has_constant_alpha-impr
Open

Improve has_constant_alpha* methods#2941
RunDevelopment wants to merge 4 commits into
image-rs:mainfrom
RunDevelopment:has_constant_alpha-impr

Conversation

@RunDevelopment
Copy link
Copy Markdown
Member

Changes:

  • Use pixel.alpha() instead of *pixel.channels().last().unwrap()
  • Use bit-wise or to fill the accumulator instead of add.

Copy link
Copy Markdown
Contributor

@awxkee awxkee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Interesting idea, I haven't thought about that. Modern cpus with SIMD have the same latencies on ors and adds so I'd expect there is no difference if vectorized successfully, so I checked codegen out of curiosity.

Without avx2 both make garbage codegen where one does pointless vectorization whereas other doesn't https://godbolt.org/z/fhvqe3T7M. With avx2 https://godbolt.org/z/WTPY6qdqq it produces bit exact codegen where adds simply replaced with ors what is expected, since or and add have the same thoughtput and latencies there is no difference what is expected.
Both do something outstanding on aarch64 https://godbolt.org/z/f5MYcGeor, I don't want to benchmark actually but seems ors should be a bit better

@RunDevelopment
Copy link
Copy Markdown
Member Author

RunDevelopment commented May 10, 2026

Honestly, I didn't think about pref when making this. I wanted to use or instead of add, because the u64 could overflow with add. Super super super unlikely, but making that impossible just felt better.

Regarding code gen, yeah it's pretty bad. I think the root of all evil here is our memory access pattern. We're always reading 1 byte with 3 bytes in between. I think LLVM loses the information that it can safely access the bytes in between and so doesn't use this to optimize. If I make it load whole pixels and mask out everything but the alpha channel, code gen becomes a lot better.

This could probably be made even better by chunking some constant numbers of pixels (e.g. 128) and then handling the rest separately. Then the CPU could churn through all those pixels without branching. But I would expect what I linked above to be memory bound already, so that's probably not worth it. I didn't do any benchmark though, so I could be wrong.

since or and add have the same thoughtput and latencies there is no difference what is expected.

Oh, I didn't know that. TIL.

@awxkee
Copy link
Copy Markdown
Contributor

awxkee commented May 10, 2026

Masking looks in a way it should be, on aarch64 as well. I wonder why LLVM autovectorizer is able to figure it out only with exactly avx2...

This could probably be made even better by chunking some constant numbers of pixels (e.g. 128) and then handling the rest separately. Then the CPU could churn through all those pixels without branching. But I would expect what I linked above to be memory bound already, so that's probably not worth it. I didn't do any benchmark though, so I could be wrong.

Yeah, I also think this will complicate the code with no measurable outcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants