-
Notifications
You must be signed in to change notification settings - Fork 2
Use SIMD instructions for bitwise operations (gcc specific). #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Avoid case by generating BITWISE_FUNC and BITARRAY_FUNC.
* Change type of zero and one to (unsigned char) due to compiler
complaining of overflows on '-pedantic'.
This resulted in a ~10x speedup on large bitarrays for me using the following
test.
```python
import bitarray
import timeit
a = bitarray.bitarray(50000)
b = bitarray.bitarray(50000)
def test_and():
global a
a &= b
def test_or():
global a
a |= b
def test_xor():
global a
a ^= b
print timeit.timeit("test_and()", "from __main__ import test_and")
print timeit.timeit("test_or()", "from __main__ import test_or")
print timeit.timeit("test_xor()", "from __main__ import test_xor")
```
```
upstream master:
20.3912520409
20.6214001179
20.5252711773
with this patch:
2.11912703514
2.14890694618
2.1437420845
```
About memcpy usage in simd_v16uc_op:
I found that `memcpy` does a good job when inspecting compiler output.
On my system it uses `movdqa` to copy memory to and from xmm registers.
|
I am unaware of how python modules are built in non unix systems. Since these are GCC specific options, will it break for windows or Mac? If so is there a way to have graceful degredation by say, putting a basic version and a high speed GCC optimized version in different C files and compiling+linking the appropriate one? If everything uses GCC, then I guess there is not really a problem. |
|
Looks like you are already using the appropriate ifdefs to make sure it is portable. |
|
@andre-merzky Do you have any opinions on this PR before I merge it? It looks pretty clean. No API changes, just consolidation of several functions into preprocessor generated functions with optional high speed implementations. |
|
I can't really say much on the simd calls, that is not my area -- but it looks clean enough to me otherwise indeed. The define-concats do not make the code simpler or cleaerer, of course, but that is how it usually goes with perf optimization :) |
|
Yeah I was a bit concerned with the clever use of macros. But I think it is probably fine since it is only used for &, ^, |, etc. |
|
If you don't mind me chiming in, I think the code looks great. The only thing is, |
|
Thanks @eltoder I like your suggestion about the defined macros. |
|
@eltoder thanks for the feedback, I like the suggestion and cleaned it up accordingly. I'd also appreciate your opinion on using |
|
This PR looks all good to go, but we might as well wait for feedback on emmintrin.h from @eltoder so we can all learn a thing or two. |
bitarray/_bitarray.c
Outdated
| for (; i + sizeof(vec) < Py_SIZE(SELF); i += sizeof(vec)) { \ | ||
| vector_op((SELF)->ob_item + i, (OTHER)->ob_item + i, OP); \ | ||
| } \ | ||
| \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever way to use the same counter for bulk operations and then the trailing bits.
eltoder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@udoprog <emmintrin.h> is the old way, it only works on x86. Also, you're supposed to use intrinsics like _mm_and_si128 with it instead of operators. I prefer __attribute__ vector_size, because it works on ARM and PPC, and the syntax is nicer.
Also, small comments inline.
| */ | ||
| #define vector_op(A, B, OP) do { \ | ||
| vec __a, __b, __r; \ | ||
| memcpy(&__a, A, sizeof(vec)); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make sure ob_item is 16 bytes aligned, we could get rid of memcpy, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how, could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting the pointers should work? But we'll have to align memory for ob_item, e.g. by using posix_memalign() / _aligned_malloc().
To be clear, I'm asking if this is something you have tried, not suggesting you do it now. I expect it to generate less moves, as we'll tell the compiler that the memory is aligned on 16 bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no I haven't done anything to guarantee alignment. I did note that during the initial patch we perform an aligned move (movdqa) which is coincidental. Now it seems I get movqdu which is the unaligned version. It doesn't appear to affect performance much with my limited test case though which is good, so I'm not sure if it's worth aiming for.
Looking at the assembly loop it largely seems to do a pretty good job at the moment.
There is one odd move which I would have expected it to optimize out that I noticed when looking closer:
.L784:
movq %rdx, %rax
.L790:
leaq -16(%rax), %rcx
movq 32(%rbx), %rdi ; (SELF)->ob_item
movq %rcx, %rdx ; (OTHER)->ob_item
addq 32(%rbp), %rdx
movdqu -16(%rdi,%rax), %xmm0
movdqu (%rdx), %xmm1
pand %xmm1, %xmm0
movups %xmm0, (%rdx)
leaq 16(%rax), %rdx
cmpq %rdx, %rsi
ja .L784After the latest patch (0b09add), this becomes:
.L796:
movq %rcx, %rax
.L822:
movdqu -16(%rdx,%rax), %xmm0
leaq 16(%rax), %rcx
movdqu -16(%rsi,%rax), %xmm1
cmpq %rcx, %rdi
pand %xmm1, %xmm0
movups %xmm0, -16(%rdx,%rax)
ja .L796Lo and behold, I go from:
0.521735191345
0.509098052979
0.512199163437
To:
0.393866062164
0.399346828461
0.393866062164
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SO FAST!
It does not look like you changed the malloc to be aligned. How are you able to take advantage of this without aligning the memory?
Oh it looks like it is from moving the cast out of the loop. I see :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@diamondman yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@udoprog I see :-) Can you please post the new reference timings, so we know the actual speed up?
@diamondman Yes, compilers are fairly liberal with memcpy optimizations these days, and according to @udoprog's disassembly this is exactly what's happening here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eltoder I've updated the timings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@udoprog nice! so we get close to 12x speed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good then!
bitarray/_bitarray.c
Outdated
| #define BITWISE_FUNC_INTERNAL(SELF, OTHER, OP, OPEQ) do { \ | ||
| Py_ssize_t i = 0; \ | ||
| \ | ||
| for (; i + sizeof(vec) < Py_SIZE(SELF); i += sizeof(vec)) { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you cache size in a local variable, compiler will have an easier time proving that it doesn't change inside the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcc realized that with -O2, but as per your suggestion I made it more explicit.
|
Nice! Is there anything else you two want to do to this PR or shall I accept it? |
bitarray/_bitarray.c
Outdated
| enum op_type { | ||
| OP_and, | ||
| OP_or, | ||
| OP_xor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can delete the enum too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
|
I am more than satisfied with this PR. If everyone agrees we are done here, I will accept it. |
complaining of overflows on '-pedantic'.
This resulted in a ~10x speedup on large bitarrays for me using the following
test.
upstream master (updated 2016-10):
with this patch (updated 2016-10):
About memcpy usage in simd_v16uc_op:
I found that
memcpydoes a good job when inspecting compiler output.On my system it uses
movdqato copy memory to and from xmm registers.