-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392
Description
Problem Statement
In the date histogram aggregation, the timestamp value from each hit must be rounded down to the nearest interval (year, quarter, month, week, day, etc.) as defined in the search request. This rounded-down timestamp serves as the bucket key to aggregate results.
#9727 improved performance by using linear search when the array size is small (<= 64 elements). This POC evaluates if performance can be improved further using SIMD (related to #9423).
1 — Baseline
The current approach uses linear search when the search space is small or binary search otherwise. This is because linear search is far superior for small arrays as it avoids the penalty of branch mispredictions and pipeline stalls and accesses memory sequentially. To simplify the remaining comparison, I'll assume that each value is equally likely to be picked (uniform distribution), so we need not consider bidirectional linear search in this POC.
2 — Vectorized Linear Search
Assume a platform that supports AVX2 instructions (i.e., 256-bit wide registers). We perform a linear search comparing 4 longs at once, taking a stride of 4 elements with each iteration. We pad the array with
3 — Vectorized B-Tree Search
Here, the idea is to improve binary search using vectorization. Since binary search compares values at non-contiguous memory locations, which is non-trivial to load into a vector register, the first step is to fix the memory layout.
- The first idea is to use the Eytzinger layout. It is an implicit binary tree packed into an array where the left and the right sub-trees of a node at index
$i$ are stored at$2i+1$ and$2i+2$ indices respectively. This is also how binary heaps are laid out in memory. - The second idea is to store more multiple values (
$B$ ) per node, which makes the branching factor ($1+B$ ). This is essentially a B-tree! But unlike B-trees where the number of values is chosen to fit in a single cache line, we choose it to be equal to the number of vector lanes. - The underlying array is padded to the nearest multiple of vector lanes. We do not need to provision memory for completely empty nodes. I also experimented with a complete tree (all levels are filled except the last level, which is filled from left to right) but didn't see much difference in performance.
To find the round-down point in this layout, we start our search from the
POC code
Code for the above implementations (and many others) can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/main/java/org/opensearch/Rounder.java
Results
Results are from c6i.2xlarge EC2 instance (Intel Xeon 8375C CPU @ 2.90GHz) which supports full 512-bit registers (8 vector lanes for long).
Cpuinfo Version: 7.0.0
Vendor ID Raw: GenuineIntel
Hardware Raw:
Brand Raw: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Hz Advertised Friendly: 2.9000 GHz
Hz Actual Friendly: 3.4986 GHz
Hz Advertised: (2900000000, 0)
Hz Actual: (3498596000, 0)
Arch: X86_64
Bits: 64
Count: 16
Arch String Raw: x86_64
L1 Data Cache Size: 384 KiB (8 instances)
L1 Instruction Cache Size: 256 KiB (8 instances)
L2 Cache Size: 10 MiB (8 instances)
L2 Cache Line Size: 256
L2 Cache Associativity: 6
L3 Cache Size: 56623104
Stepping: 6
Model: 106
Family: 6
Processor Type:
Flags: 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, arch_capabilities, avx, avx2, avx512_bitalg, avx512_vbmi2, avx512_vnni, avx512_vpopcntdq, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vpopcntdq, bmi1, bmi2, clflush, clflushopt, clwb, cmov, constant_tsc, cpuid, cx16, cx8, de, erms, f16c, flush_l1d, fma, fpu, fsgsbase, fxsr, gfni, ht, hypervisor, ibpb, ibrs, ibrs_enhanced, ida, invpcid, invpcid_single, lahf_lm, lm, mca, mce, md_clear, mmx, movbe, msr, mtrr, nonstop_tsc, nopl, nx, ospke, osxsave, pae, pat, pcid, pclmulqdq, pdpe1gb, pge, pku, pni, popcnt, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdtscp, rep_good, sep, sha, sha_ni, smap, smep, ss, ssbd, sse, sse2, sse4_1, sse4_2, ssse3, stibp, syscall, tme, tsc, tsc_adjust, tsc_deadline_timer, tsc_known_freq, tscdeadline, vaes, vme, vpclmulqdq, wbnoinvd, x2apic, xgetbv1, xsave, xsavec, xsaveopt, xsaves, xtopology
Micro Benchmarks
These results compare the performance to look up round-down points for 1 million random keys picked from a uniform distribution. Since pre-computed array rounding is only used for small arrays (code), I've limited the benchmarks to array sizes up to 256.
Code for the micro benchmark harness can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/jmh/java/org/opensearch/RounderBenchmark.java
| size | binary | linear | hybrid | eytzinger | vectorized_linear | vectorized_eytzinger | unrolled_eytzinger | vectorized_btree |
|---|---|---|---|---|---|---|---|---|
| 1 | 295.947 | 323.76 | 301.949 | 338.849 | 244.832 | 204.102 | 253.831 | 234.225 |
| 2 | 90.548 | 93.462 | 90.892 | 91.012 | 244.903 | 204.151 | 87.779 | 234.251 |
| 3 | 71.746 | 76.643 | 76.824 | 74.248 | 244.886 | 204.12 | 106.255 | 234.136 |
| 4 | 65.067 | 68.915 | 69.974 | 61.451 | 244.737 | 204.287 | 89.979 | 234.172 |
| 5 | 55.412 | 65.889 | 66.058 | 57.786 | 244.869 | 204.388 | 79.007 | 234.26 |
| 6 | 51.574 | 64.382 | 63.812 | 53.777 | 244.905 | 204.282 | 69.455 | 234.156 |
| 7 | 48.416 | 61.33 | 62.01 | 50.647 | 244.747 | 204.372 | 65.809 | 234.111 |
| 8 | 47.702 | 60.355 | 60.123 | 47.504 | 162.481 | 204.162 | 63.483 | 233.679 |
| 9 | 45.354 | 59.581 | 59.805 | 46.404 | 121.948 | 148.625 | 56.026 | 114.642 |
| 10 | 43.744 | 58.828 | 58.401 | 44.667 | 101.297 | 120.654 | 52.898 | 99.97 |
| 12 | 40.655 | 56.81 | 56.354 | 42.518 | 82.603 | 92.188 | 53.281 | 83.234 |
| 14 | 38.731 | 56.113 | 56.068 | 40.256 | 77.681 | 77.288 | 51.574 | 75.347 |
| 16 | 37.615 | 54.779 | 54.723 | 37.298 | 74.428 | 72.294 | 50.882 | 73.562 |
| 18 | 36.908 | 53.505 | 53.931 | 36.632 | 68.818 | 78.296 | 49.609 | 124.952 |
| 20 | 35.854 | 52.526 | 52.702 | 36.131 | 65.477 | 82.456 | 49.973 | 107.786 |
| 22 | 35.123 | 51.663 | 52.095 | 35.515 | 65.466 | 86.55 | 50.202 | 97.138 |
| 24 | 34.347 | 50.718 | 51.107 | 34.691 | 62.389 | 89.973 | 50.378 | 88.548 |
| 26 | 32.883 | 50.147 | 50.473 | 33.915 | 62.359 | 97.614 | 48.697 | 136.888 |
| 29 | 30.583 | 48.334 | 49.443 | 32.826 | 60.524 | 100.794 | 46.673 | 115.876 |
| 32 | 29.595 | 47.177 | 47.937 | 31.706 | 56.593 | 68.765 | 40.408 | 66.419 |
| 37 | 29.848 | 46.967 | 46.555 | 31.43 | 57.42 | 72.925 | 39.564 | 77.961 |
| 41 | 29.811 | 45.541 | 45.27 | 30.857 | 52.363 | 76.323 | 39.398 | 83.197 |
| 45 | 29.556 | 44.105 | 43.571 | 29.062 | 50.98 | 79.744 | 39.099 | 85.851 |
| 49 | 28.931 | 42.526 | 42.466 | 29.751 | 52.5 | 81.896 | 37.444 | 86.226 |
| 54 | 27.374 | 40.689 | 41.094 | 28.304 | 51.97 | 85.37 | 37.668 | 87.361 |
| 60 | 25.949 | 39.313 | 39.48 | 27.921 | 50.749 | 88.148 | 36.952 | 90.138 |
| 64 | 25.58 | 37.665 | 38.654 | 27.902 | 49.935 | 90.495 | 35.73 | 90.597 |
| 74 | 25.846 | 37.815 | 35.785 | 27.282 | 48.425 | 93.44 | 35.207 | 92.326 |
| 83 | 25.802 | 36.663 | 35.015 | 27.173 | 44.904 | 88.798 | 33.979 | 107.161 |
| 90 | 25.667 | 35.771 | 34.606 | 26.868 | 46.483 | 101.417 | 34.684 | 87.596 |
| 98 | 25.189 | 35.14 | 34.186 | 26.426 | 45.846 | 84.157 | 34.626 | 74.626 |
| 108 | 24.39 | 34.062 | 33.866 | 25.769 | 45.428 | 72.028 | 33.819 | 67.345 |
| 118 | 23.36 | 32.94 | 33.167 | 25.148 | 45.474 | 63.726 | 33.487 | 61.756 |
| 128 | 22.822 | 32.137 | 32.764 | 24.592 | 44.685 | 58.101 | 33.241 | 58.654 |
| 144 | 22.918 | 31.154 | 29.949 | 24.454 | 44.492 | 54.808 | 32.561 | 55.242 |
| 159 | 22.866 | 30.141 | 29.577 | 24.587 | 43.129 | 54.998 | 32.347 | 55.568 |
| 171 | 22.899 | 29.35 | 29.319 | 24.223 | 42.683 | 55.818 | 32.496 | 57.786 |
| 187 | 22.614 | 28.353 | 28.732 | 23.867 | 42.226 | 58.316 | 32.375 | 60.868 |
| 204 | 22.161 | 27.33 | 28.79 | 23.311 | 41.591 | 60.952 | 32.127 | 62.728 |
| 229 | 21.273 | 25.934 | 28.433 | 22.11 | 40.616 | 64.439 | 31.989 | 66.387 |
| 256 | 20.548 | 24.543 | 27.909 | 21.573 | 40.063 | 66.471 | 31.358 | 68.556 |
Macro Benchmarks
These results compare the performance to execute the following aggregation with varying intervals using the noaa dataset.
{
"size": 0,
"aggs": {
"observations_per_month": {
"date_histogram": {
"field": "date",
"interval": "month"
}
}
}
}| Interval | Num Buckets | Num Hits | Baseline (ms) | Vectorized Linear (ms) | Vectorized B-Tree (ms) |
|---|---|---|---|---|---|
| year | 3 | 33659481 | 828 | 562 | 563 |
| quarter | 12 | 33659481 | 1046 | 887 | 878 |
| month | 36 | 33659481 | 1170 | 1408 | 1241 |
These results compare the performance to execute the following aggregation with increasing number of months.
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "2014-01-01",
"lt": "2015-01-01"
}
}
},
"aggs": {
"observations_per_month": {
"date_histogram": {
"field": "date",
"interval": "month"
}
}
}
}| Num Months | Num Hits | Baseline (ms) | Vectorized Linear (ms) | Vectorized B-Tree (ms) |
|---|---|---|---|---|
| 12 | 11392659 | 438 | 336 | 296 |
| 14 | 13187069 | 510 | 393 | 337 |
| 16 | 15073450 | 586 | 456 | 383 |
| 18 | 17009688 | 648 | 584 | 485 |
| 20 | 18957437 | 735 | 642 | 537 |
| 22 | 20865502 | 804 | 727 | 592 |
| 24 | 22745413 | 874 | 789 | 629 |
| 26 | 24548269 | 927 | 877 | 717 |
| 28 | 26395510 | 960 | 985 | 798 |
| 30 | 28239883 | 1018 | 1085 | 896 |
| 32 | 30094407 | 1090 | 1163 | 960 |
| 34 | 31896193 | 1148 | 1303 | 1162 |
| 36 | 33659481 | 1199 | 1413 | 1255 |
Hot Methods Profile
To understand why 36-months' aggregation isn't seeing the expected speed-up, I took a profile of hot methods.
70.69% jdk/incubator/vector/Long512Vector$Long512Mask.firstTrue ()I Inlined
6.63% org/opensearch/common/util/ReorganizingLongHash.find (J)J Inlined
5.33% org/opensearch/common/BidirectionalLinearSearchArrayRounding.round (J)J Inlined
3.46% jdk/incubator/vector/LongVector$LongSpecies.broadcastBits (J)Ljdk/incubator/vector/LongVector; Inlined
2.81% org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$5.longValue ()J Inlined
1.70% org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator$1.collect (IJ)V Inlined
1.68% org/opensearch/common/util/BigArrays$LongArrayWrapper.get (J)J Inlined
1.50% org/opensearch/search/aggregations/LeafBucketCollector.collect (I)V JIT compiled
Turns out that VectorMask::firstTrue itself performs a linear scan on the result, which almost defeats the purpose of vectorization for this use-case. While there is some improvement still, it could have been much greater if Java had intrinsified this with bit-scan instructions, which is widely available on almost all platforms.
Conclusion
TODO
Follow-Up Questions
- What makes the vectorized implementations slower as it approaches, say, 36 buckets?
- What makes the vectorized B-tree and Eytzinger slow-then-fast as the size increases? (possibly branch misprediction).
- How does the performance look on a platform that supports 128-bit and 256-bit vector registers? What's the minimum vector size to use?


