Improve has_constant_alpha* methods#2941
Conversation
There was a problem hiding this comment.
LGTM
Interesting idea, I haven't thought about that. Modern cpus with SIMD have the same latencies on ors and adds so I'd expect there is no difference if vectorized successfully, so I checked codegen out of curiosity.
Without avx2 both make garbage codegen where one does pointless vectorization whereas other doesn't https://godbolt.org/z/fhvqe3T7M. With avx2 https://godbolt.org/z/WTPY6qdqq it produces bit exact codegen where adds simply replaced with ors what is expected, since or and add have the same thoughtput and latencies there is no difference what is expected.
Both do something outstanding on aarch64 https://godbolt.org/z/f5MYcGeor, I don't want to benchmark actually but seems ors should be a bit better
|
Honestly, I didn't think about pref when making this. I wanted to use or instead of add, because the u64 could overflow with add. Super super super unlikely, but making that impossible just felt better. Regarding code gen, yeah it's pretty bad. I think the root of all evil here is our memory access pattern. We're always reading 1 byte with 3 bytes in between. I think LLVM loses the information that it can safely access the bytes in between and so doesn't use this to optimize. If I make it load whole pixels and mask out everything but the alpha channel, code gen becomes a lot better. This could probably be made even better by chunking some constant numbers of pixels (e.g. 128) and then handling the rest separately. Then the CPU could churn through all those pixels without branching. But I would expect what I linked above to be memory bound already, so that's probably not worth it. I didn't do any benchmark though, so I could be wrong.
Oh, I didn't know that. TIL. |
|
Masking looks in a way it should be, on aarch64 as well. I wonder why LLVM autovectorizer is able to figure it out only with exactly avx2...
Yeah, I also think this will complicate the code with no measurable outcome. |
Changes:
pixel.alpha()instead of*pixel.channels().last().unwrap()