Skip to content

Conversation

@udoprog
Copy link

@udoprog udoprog commented Oct 22, 2016

  • Avoid case by generating BITWISE_FUNC and BITARRAY_FUNC.
  • Change type of zero and one to (unsigned char) due to compiler
    complaining of overflows on '-pedantic'.

This resulted in a ~10x speedup on large bitarrays for me using the following
test.

import bitarray
import timeit

a = bitarray.bitarray(50000)
b = bitarray.bitarray(50000)

def test_and():
    global a
    a &= b

def test_or():
    global a
    a |= b

def test_xor():
    global a
    a ^= b

print timeit.timeit("test_and()", "from __main__ import test_and")
print timeit.timeit("test_or()", "from __main__ import test_or")
print timeit.timeit("test_xor()", "from __main__ import test_xor")

upstream master (updated 2016-10):

4.73466801643
4.74092507362
4.6969268322

with this patch (updated 2016-10):

0.390768051147
0.403351068497
0.397214889526

About memcpy usage in simd_v16uc_op:

I found that memcpy does a good job when inspecting compiler output.
On my system it uses movdqa to copy memory to and from xmm registers.

* Avoid case by generating BITWISE_FUNC and BITARRAY_FUNC.
* Change type of zero and one to (unsigned char) due to compiler
  complaining of overflows on '-pedantic'.

This resulted in a ~10x speedup on large bitarrays for me using the following
test.

```python
import bitarray
import timeit

a = bitarray.bitarray(50000)
b = bitarray.bitarray(50000)

def test_and():
    global a
    a &= b

def test_or():
    global a
    a |= b

def test_xor():
    global a
    a ^= b

print timeit.timeit("test_and()", "from __main__ import test_and")
print timeit.timeit("test_or()", "from __main__ import test_or")
print timeit.timeit("test_xor()", "from __main__ import test_xor")
```

```
upstream master:
20.3912520409
20.6214001179
20.5252711773

with this patch:
2.11912703514
2.14890694618
2.1437420845
```

About memcpy usage in simd_v16uc_op:

I found that `memcpy` does a good job when inspecting compiler output.
On my system it uses `movdqa` to copy memory to and from xmm registers.
@diamondman
Copy link
Owner

I am unaware of how python modules are built in non unix systems. Since these are GCC specific options, will it break for windows or Mac? If so is there a way to have graceful degredation by say, putting a basic version and a high speed GCC optimized version in different C files and compiling+linking the appropriate one?

If everything uses GCC, then I guess there is not really a problem.

@diamondman
Copy link
Owner

Looks like you are already using the appropriate ifdefs to make sure it is portable.

@diamondman
Copy link
Owner

@andre-merzky Do you have any opinions on this PR before I merge it? It looks pretty clean. No API changes, just consolidation of several functions into preprocessor generated functions with optional high speed implementations.

@andre-merzky
Copy link

I can't really say much on the simd calls, that is not my area -- but it looks clean enough to me otherwise indeed. The define-concats do not make the code simpler or cleaerer, of course, but that is how it usually goes with perf optimization :)

@diamondman
Copy link
Owner

Yeah I was a bit concerned with the clever use of macros. But I think it is probably fine since it is only used for &, ^, |, etc.

@eltoder
Copy link

eltoder commented Oct 23, 2016

If you don't mind me chiming in, I think the code looks great. The only thing is, IS_GCC can be renamed to HAS_VECTORS or SUPPORTS_VECTORS and set to (defined(__GNUC__) || defined(__clang__)), since Clang supports the same syntax. It is slightly moot, since Clang appears to set __GNUC__ as well, but is the right-er thing to do.

@diamondman
Copy link
Owner

Thanks @eltoder
I have a lot of web server and embedded hardware experience, but not a lot of experience getting max performance out of intel machines using C and GCC, so I really appreciate the 'chiming in'.

I like your suggestion about the defined macros.

@udoprog
Copy link
Author

udoprog commented Oct 23, 2016

@eltoder thanks for the feedback, I like the suggestion and cleaned it up accordingly. I'd also appreciate your opinion on using emmintrin.h and __m128i instead of attribute syntax (see latest push). It should be more portable across compilers and yield the same result. __m128i should also make it a bit more clear that the feature is SSE2 specific.

@diamondman
Copy link
Owner

This PR looks all good to go, but we might as well wait for feedback on emmintrin.h from @eltoder so we can all learn a thing or two.

for (; i + sizeof(vec) < Py_SIZE(SELF); i += sizeof(vec)) { \
vector_op((SELF)->ob_item + i, (OTHER)->ob_item + i, OP); \
} \
\
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever way to use the same counter for bulk operations and then the trailing bits.

Copy link

@eltoder eltoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@udoprog <emmintrin.h> is the old way, it only works on x86. Also, you're supposed to use intrinsics like _mm_and_si128 with it instead of operators. I prefer __attribute__ vector_size, because it works on ARM and PPC, and the syntax is nicer.

Also, small comments inline.

*/
#define vector_op(A, B, OP) do { \
vec __a, __b, __r; \
memcpy(&__a, A, sizeof(vec)); \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make sure ob_item is 16 bytes aligned, we could get rid of memcpy, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how, could you elaborate?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting the pointers should work? But we'll have to align memory for ob_item, e.g. by using posix_memalign() / _aligned_malloc().
To be clear, I'm asking if this is something you have tried, not suggesting you do it now. I expect it to generate less moves, as we'll tell the compiler that the memory is aligned on 16 bytes.

Copy link
Author

@udoprog udoprog Oct 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no I haven't done anything to guarantee alignment. I did note that during the initial patch we perform an aligned move (movdqa) which is coincidental. Now it seems I get movqdu which is the unaligned version. It doesn't appear to affect performance much with my limited test case though which is good, so I'm not sure if it's worth aiming for.

Looking at the assembly loop it largely seems to do a pretty good job at the moment.
There is one odd move which I would have expected it to optimize out that I noticed when looking closer:

.L784:
        movq    %rdx, %rax
.L790:
        leaq    -16(%rax), %rcx
        movq    32(%rbx), %rdi ; (SELF)->ob_item
        movq    %rcx, %rdx     ; (OTHER)->ob_item
        addq    32(%rbp), %rdx
        movdqu  -16(%rdi,%rax), %xmm0
        movdqu  (%rdx), %xmm1
        pand    %xmm1, %xmm0
        movups  %xmm0, (%rdx)
        leaq    16(%rax), %rdx
        cmpq    %rdx, %rsi
        ja      .L784

After the latest patch (0b09add), this becomes:

.L796:
        movq    %rcx, %rax
.L822:
        movdqu  -16(%rdx,%rax), %xmm0
        leaq    16(%rax), %rcx
        movdqu  -16(%rsi,%rax), %xmm1
        cmpq    %rcx, %rdi
        pand    %xmm1, %xmm0
        movups  %xmm0, -16(%rdx,%rax)
        ja      .L796

Lo and behold, I go from:

0.521735191345
0.509098052979
0.512199163437

To:

0.393866062164
0.399346828461
0.393866062164

Copy link
Owner

@diamondman diamondman Oct 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO FAST!
It does not look like you changed the malloc to be aligned. How are you able to take advantage of this without aligning the memory?

Oh it looks like it is from moving the cast out of the loop. I see :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@udoprog I see :-) Can you please post the new reference timings, so we know the actual speed up?

@diamondman Yes, compilers are fairly liberal with memcpy optimizations these days, and according to @udoprog's disassembly this is exactly what's happening here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eltoder I've updated the timings.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@udoprog nice! so we get close to 12x speed up.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good then!

#define BITWISE_FUNC_INTERNAL(SELF, OTHER, OP, OPEQ) do { \
Py_ssize_t i = 0; \
\
for (; i + sizeof(vec) < Py_SIZE(SELF); i += sizeof(vec)) { \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you cache size in a local variable, compiler will have an easier time proving that it doesn't change inside the loop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcc realized that with -O2, but as per your suggestion I made it more explicit.

@diamondman
Copy link
Owner

Nice! Is there anything else you two want to do to this PR or shall I accept it?

enum op_type {
OP_and,
OP_or,
OP_xor,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete the enum too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

@diamondman
Copy link
Owner

I am more than satisfied with this PR. If everyone agrees we are done here, I will accept it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants