Use SIMD instructions for bitwise operations (gcc specific). #5

udoprog · 2016-10-22T21:52:37Z

Avoid case by generating BITWISE_FUNC and BITARRAY_FUNC.
Change type of zero and one to (unsigned char) due to compiler
complaining of overflows on '-pedantic'.

This resulted in a ~10x speedup on large bitarrays for me using the following
test.

import bitarray
import timeit

a = bitarray.bitarray(50000)
b = bitarray.bitarray(50000)

def test_and():
    global a
    a &= b

def test_or():
    global a
    a |= b

def test_xor():
    global a
    a ^= b

print timeit.timeit("test_and()", "from __main__ import test_and")
print timeit.timeit("test_or()", "from __main__ import test_or")
print timeit.timeit("test_xor()", "from __main__ import test_xor")

upstream master (updated 2016-10):

4.73466801643
4.74092507362
4.6969268322

with this patch (updated 2016-10):

0.390768051147
0.403351068497
0.397214889526

About memcpy usage in simd_v16uc_op:

I found that memcpy does a good job when inspecting compiler output.
On my system it uses movdqa to copy memory to and from xmm registers.

* Avoid case by generating BITWISE_FUNC and BITARRAY_FUNC. * Change type of zero and one to (unsigned char) due to compiler complaining of overflows on '-pedantic'. This resulted in a ~10x speedup on large bitarrays for me using the following test. ```python import bitarray import timeit a = bitarray.bitarray(50000) b = bitarray.bitarray(50000) def test_and(): global a a &= b def test_or(): global a a |= b def test_xor(): global a a ^= b print timeit.timeit("test_and()", "from __main__ import test_and") print timeit.timeit("test_or()", "from __main__ import test_or") print timeit.timeit("test_xor()", "from __main__ import test_xor") ``` ``` upstream master: 20.3912520409 20.6214001179 20.5252711773 with this patch: 2.11912703514 2.14890694618 2.1437420845 ``` About memcpy usage in simd_v16uc_op: I found that `memcpy` does a good job when inspecting compiler output. On my system it uses `movdqa` to copy memory to and from xmm registers.

diamondman · 2016-10-22T23:22:39Z

I am unaware of how python modules are built in non unix systems. Since these are GCC specific options, will it break for windows or Mac? If so is there a way to have graceful degredation by say, putting a basic version and a high speed GCC optimized version in different C files and compiling+linking the appropriate one?

If everything uses GCC, then I guess there is not really a problem.

diamondman · 2016-10-22T23:24:43Z

Looks like you are already using the appropriate ifdefs to make sure it is portable.

diamondman · 2016-10-22T23:52:32Z

@andre-merzky Do you have any opinions on this PR before I merge it? It looks pretty clean. No API changes, just consolidation of several functions into preprocessor generated functions with optional high speed implementations.

andre-merzky · 2016-10-23T00:11:30Z

I can't really say much on the simd calls, that is not my area -- but it looks clean enough to me otherwise indeed. The define-concats do not make the code simpler or cleaerer, of course, but that is how it usually goes with perf optimization :)

diamondman · 2016-10-23T00:17:49Z

Yeah I was a bit concerned with the clever use of macros. But I think it is probably fine since it is only used for &, ^, |, etc.

eltoder · 2016-10-23T01:40:57Z

If you don't mind me chiming in, I think the code looks great. The only thing is, IS_GCC can be renamed to HAS_VECTORS or SUPPORTS_VECTORS and set to (defined(__GNUC__) || defined(__clang__)), since Clang supports the same syntax. It is slightly moot, since Clang appears to set __GNUC__ as well, but is the right-er thing to do.

diamondman · 2016-10-23T01:48:54Z

Thanks @eltoder
I have a lot of web server and embedded hardware experience, but not a lot of experience getting max performance out of intel machines using C and GCC, so I really appreciate the 'chiming in'.

I like your suggestion about the defined macros.

udoprog · 2016-10-23T03:33:42Z

@eltoder thanks for the feedback, I like the suggestion and cleaned it up accordingly. I'd also appreciate your opinion on using emmintrin.h and __m128i instead of attribute syntax (see latest push). It should be more portable across compilers and yield the same result. __m128i should also make it a bit more clear that the feature is SSE2 specific.

diamondman · 2016-10-23T06:33:37Z

This PR looks all good to go, but we might as well wait for feedback on emmintrin.h from @eltoder so we can all learn a thing or two.

diamondman · 2016-10-23T06:36:25Z

bitarray/_bitarray.c

+    for (; i + sizeof(vec) < Py_SIZE(SELF); i += sizeof(vec)) {   \
+        vector_op((SELF)->ob_item + i, (OTHER)->ob_item + i, OP); \
+    }                                                             \
+                                                                  \


Clever way to use the same counter for bulk operations and then the trailing bits.

eltoder

@udoprog <emmintrin.h> is the old way, it only works on x86. Also, you're supposed to use intrinsics like _mm_and_si128 with it instead of operators. I prefer __attribute__ vector_size, because it works on ARM and PPC, and the syntax is nicer.

Also, small comments inline.

eltoder · 2016-10-23T14:04:04Z

bitarray/_bitarray.c

+ */
+#define vector_op(A, B, OP) do {  \
+    vec __a, __b, __r;            \
+    memcpy(&__a, A, sizeof(vec)); \


If we make sure ob_item is 16 bytes aligned, we could get rid of memcpy, right?

I'm not sure how, could you elaborate?

Casting the pointers should work? But we'll have to align memory for ob_item, e.g. by using posix_memalign() / _aligned_malloc().
To be clear, I'm asking if this is something you have tried, not suggesting you do it now. I expect it to generate less moves, as we'll tell the compiler that the memory is aligned on 16 bytes.

Ah, no I haven't done anything to guarantee alignment. I did note that during the initial patch we perform an aligned move (movdqa) which is coincidental. Now it seems I get movqdu which is the unaligned version. It doesn't appear to affect performance much with my limited test case though which is good, so I'm not sure if it's worth aiming for.

Looking at the assembly loop it largely seems to do a pretty good job at the moment.
There is one odd move which I would have expected it to optimize out that I noticed when looking closer:

.L784: movq %rdx, %rax .L790: leaq -16(%rax), %rcx movq 32(%rbx), %rdi ; (SELF)->ob_item movq %rcx, %rdx ; (OTHER)->ob_item addq 32(%rbp), %rdx movdqu -16(%rdi,%rax), %xmm0 movdqu (%rdx), %xmm1 pand %xmm1, %xmm0 movups %xmm0, (%rdx) leaq 16(%rax), %rdx cmpq %rdx, %rsi ja .L784

After the latest patch (0b09add), this becomes:

.L796: movq %rcx, %rax .L822: movdqu -16(%rdx,%rax), %xmm0 leaq 16(%rax), %rcx movdqu -16(%rsi,%rax), %xmm1 cmpq %rcx, %rdi pand %xmm1, %xmm0 movups %xmm0, -16(%rdx,%rax) ja .L796

Lo and behold, I go from:

0.521735191345 0.509098052979 0.512199163437

To:

0.393866062164 0.399346828461 0.393866062164

SO FAST!
It does not look like you changed the malloc to be aligned. How are you able to take advantage of this without aligning the memory?

Oh it looks like it is from moving the cast out of the loop. I see :)

@diamondman yes

@udoprog I see :-) Can you please post the new reference timings, so we know the actual speed up?

@diamondman Yes, compilers are fairly liberal with memcpy optimizations these days, and according to @udoprog's disassembly this is exactly what's happening here.

@eltoder I've updated the timings.

@udoprog nice! so we get close to 12x speed up.

Looks good then!

eltoder · 2016-10-23T14:05:21Z

bitarray/_bitarray.c

+#define BITWISE_FUNC_INTERNAL(SELF, OTHER, OP, OPEQ) do {         \
+    Py_ssize_t i = 0;                                             \
+                                                                  \
+    for (; i + sizeof(vec) < Py_SIZE(SELF); i += sizeof(vec)) {   \


If you cache size in a local variable, compiler will have an easier time proving that it doesn't change inside the loop.

gcc realized that with -O2, but as per your suggestion I made it more explicit.

diamondman · 2016-10-23T17:04:57Z

Nice! Is there anything else you two want to do to this PR or shall I accept it?

eltoder · 2016-10-23T17:45:28Z

bitarray/_bitarray.c

 enum op_type {
    OP_and,
    OP_or,
    OP_xor,


You can delete the enum too.

Done, thanks.

diamondman · 2016-10-23T19:58:53Z

I am more than satisfied with this PR. If everyone agrees we are done here, I will accept it.

diamondman reviewed Oct 23, 2016

View reviewed changes

eltoder reviewed Oct 23, 2016

View reviewed changes

Introduce HAS_VECTORS and capture size variable

7715c66

eltoder reviewed Oct 23, 2016

View reviewed changes

udoprog added 2 commits October 23, 2016 19:55

Delete op_type enum

a79dcec

Optimize loop to reduce moves

0b09add

diamondman merged commit a0d0496 into diamondman:master Oct 23, 2016

udoprog deleted the simd branch October 23, 2016 21:26

diamondman mentioned this pull request Oct 24, 2016

Use SIMD instructions for bitwise operations (gcc specific). ilanschnell/bitarray#20

Closed

Use SIMD instructions for bitwise operations (gcc specific). #5

Use SIMD instructions for bitwise operations (gcc specific). #5

Uh oh!

Conversation

udoprog commented Oct 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diamondman commented Oct 22, 2016

Uh oh!

diamondman commented Oct 22, 2016

Uh oh!

diamondman commented Oct 22, 2016

Uh oh!

andre-merzky commented Oct 23, 2016

Uh oh!

diamondman commented Oct 23, 2016

Uh oh!

eltoder commented Oct 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diamondman commented Oct 23, 2016

Uh oh!

udoprog commented Oct 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diamondman commented Oct 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eltoder left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

udoprog Oct 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diamondman Oct 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diamondman commented Oct 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diamondman commented Oct 23, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

udoprog commented Oct 22, 2016 •

edited

Loading

eltoder commented Oct 23, 2016 •

edited

Loading

udoprog commented Oct 23, 2016 •

edited

Loading

udoprog Oct 23, 2016 •

edited

Loading

diamondman Oct 23, 2016 •

edited

Loading