Skip to content

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

@ketanv3

Description

@ketanv3

Problem Statement

In the date histogram aggregation, the timestamp value from each hit must be rounded down to the nearest interval (year, quarter, month, week, day, etc.) as defined in the search request. This rounded-down timestamp serves as the bucket key to aggregate results.

#9727 improved performance by using linear search when the array size is small (<= 64 elements). This POC evaluates if performance can be improved further using SIMD (related to #9423).

1 — Baseline

The current approach uses linear search when the search space is small or binary search otherwise. This is because linear search is far superior for small arrays as it avoids the penalty of branch mispredictions and pipeline stalls and accesses memory sequentially. To simplify the remaining comparison, I'll assume that each value is equally likely to be picked (uniform distribution), so we need not consider bidirectional linear search in this POC.

2 — Vectorized Linear Search

Assume a platform that supports AVX2 instructions (i.e., 256-bit wide registers). We perform a linear search comparing 4 longs at once, taking a stride of 4 elements with each iteration. We pad the array with $\infty$ to make it an exact multiple of the number of vector lanes ($B$). The search stops once we find the first value greater than the desired key.

simd-linear

3 — Vectorized B-Tree Search

Here, the idea is to improve binary search using vectorization. Since binary search compares values at non-contiguous memory locations, which is non-trivial to load into a vector register, the first step is to fix the memory layout.

  1. The first idea is to use the Eytzinger layout. It is an implicit binary tree packed into an array where the left and the right sub-trees of a node at index $i$ are stored at $2i+1$ and $2i+2$ indices respectively. This is also how binary heaps are laid out in memory.
  2. The second idea is to store more multiple values ($B$) per node, which makes the branching factor ($1+B$). This is essentially a B-tree! But unlike B-trees where the number of values is chosen to fit in a single cache line, we choose it to be equal to the number of vector lanes.
  3. The underlying array is padded to the nearest multiple of vector lanes. We do not need to provision memory for completely empty nodes. I also experimented with a complete tree (all levels are filled except the last level, which is filled from left to right) but didn't see much difference in performance.

simd-btree-levels

To find the round-down point in this layout, we start our search from the $0$-th index comparing $B$ values at once. The position of the first set bit in the vector mask tells us which sub-tree (between $0...B$) to visit to converge to the round-down point.

simd-btree

POC code

Code for the above implementations (and many others) can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/main/java/org/opensearch/Rounder.java

Results

Results are from c6i.2xlarge EC2 instance (Intel Xeon 8375C CPU @ 2.90GHz) which supports full 512-bit registers (8 vector lanes for long).

Cpuinfo Version: 7.0.0
Vendor ID Raw: GenuineIntel
Hardware Raw:
Brand Raw: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Hz Advertised Friendly: 2.9000 GHz
Hz Actual Friendly: 3.4986 GHz
Hz Advertised: (2900000000, 0)
Hz Actual: (3498596000, 0)
Arch: X86_64
Bits: 64
Count: 16
Arch String Raw: x86_64
L1 Data Cache Size: 384 KiB (8 instances)
L1 Instruction Cache Size: 256 KiB (8 instances)
L2 Cache Size: 10 MiB (8 instances)
L2 Cache Line Size: 256
L2 Cache Associativity: 6
L3 Cache Size: 56623104
Stepping: 6
Model: 106
Family: 6
Processor Type:
Flags: 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, arch_capabilities, avx, avx2, avx512_bitalg, avx512_vbmi2, avx512_vnni, avx512_vpopcntdq, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vpopcntdq, bmi1, bmi2, clflush, clflushopt, clwb, cmov, constant_tsc, cpuid, cx16, cx8, de, erms, f16c, flush_l1d, fma, fpu, fsgsbase, fxsr, gfni, ht, hypervisor, ibpb, ibrs, ibrs_enhanced, ida, invpcid, invpcid_single, lahf_lm, lm, mca, mce, md_clear, mmx, movbe, msr, mtrr, nonstop_tsc, nopl, nx, ospke, osxsave, pae, pat, pcid, pclmulqdq, pdpe1gb, pge, pku, pni, popcnt, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdtscp, rep_good, sep, sha, sha_ni, smap, smep, ss, ssbd, sse, sse2, sse4_1, sse4_2, ssse3, stibp, syscall, tme, tsc, tsc_adjust, tsc_deadline_timer, tsc_known_freq, tscdeadline, vaes, vme, vpclmulqdq, wbnoinvd, x2apic, xgetbv1, xsave, xsavec, xsaveopt, xsaves, xtopology

Micro Benchmarks

These results compare the performance to look up round-down points for 1 million random keys picked from a uniform distribution. Since pre-computed array rounding is only used for small arrays (code), I've limited the benchmarks to array sizes up to 256.

Code for the micro benchmark harness can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/jmh/java/org/opensearch/RounderBenchmark.java

Screenshot 2023-10-05 at 9 54 31 AM
size binary linear hybrid eytzinger vectorized_linear vectorized_eytzinger unrolled_eytzinger vectorized_btree
1 295.947 323.76 301.949 338.849 244.832 204.102 253.831 234.225
2 90.548 93.462 90.892 91.012 244.903 204.151 87.779 234.251
3 71.746 76.643 76.824 74.248 244.886 204.12 106.255 234.136
4 65.067 68.915 69.974 61.451 244.737 204.287 89.979 234.172
5 55.412 65.889 66.058 57.786 244.869 204.388 79.007 234.26
6 51.574 64.382 63.812 53.777 244.905 204.282 69.455 234.156
7 48.416 61.33 62.01 50.647 244.747 204.372 65.809 234.111
8 47.702 60.355 60.123 47.504 162.481 204.162 63.483 233.679
9 45.354 59.581 59.805 46.404 121.948 148.625 56.026 114.642
10 43.744 58.828 58.401 44.667 101.297 120.654 52.898 99.97
12 40.655 56.81 56.354 42.518 82.603 92.188 53.281 83.234
14 38.731 56.113 56.068 40.256 77.681 77.288 51.574 75.347
16 37.615 54.779 54.723 37.298 74.428 72.294 50.882 73.562
18 36.908 53.505 53.931 36.632 68.818 78.296 49.609 124.952
20 35.854 52.526 52.702 36.131 65.477 82.456 49.973 107.786
22 35.123 51.663 52.095 35.515 65.466 86.55 50.202 97.138
24 34.347 50.718 51.107 34.691 62.389 89.973 50.378 88.548
26 32.883 50.147 50.473 33.915 62.359 97.614 48.697 136.888
29 30.583 48.334 49.443 32.826 60.524 100.794 46.673 115.876
32 29.595 47.177 47.937 31.706 56.593 68.765 40.408 66.419
37 29.848 46.967 46.555 31.43 57.42 72.925 39.564 77.961
41 29.811 45.541 45.27 30.857 52.363 76.323 39.398 83.197
45 29.556 44.105 43.571 29.062 50.98 79.744 39.099 85.851
49 28.931 42.526 42.466 29.751 52.5 81.896 37.444 86.226
54 27.374 40.689 41.094 28.304 51.97 85.37 37.668 87.361
60 25.949 39.313 39.48 27.921 50.749 88.148 36.952 90.138
64 25.58 37.665 38.654 27.902 49.935 90.495 35.73 90.597
74 25.846 37.815 35.785 27.282 48.425 93.44 35.207 92.326
83 25.802 36.663 35.015 27.173 44.904 88.798 33.979 107.161
90 25.667 35.771 34.606 26.868 46.483 101.417 34.684 87.596
98 25.189 35.14 34.186 26.426 45.846 84.157 34.626 74.626
108 24.39 34.062 33.866 25.769 45.428 72.028 33.819 67.345
118 23.36 32.94 33.167 25.148 45.474 63.726 33.487 61.756
128 22.822 32.137 32.764 24.592 44.685 58.101 33.241 58.654
144 22.918 31.154 29.949 24.454 44.492 54.808 32.561 55.242
159 22.866 30.141 29.577 24.587 43.129 54.998 32.347 55.568
171 22.899 29.35 29.319 24.223 42.683 55.818 32.496 57.786
187 22.614 28.353 28.732 23.867 42.226 58.316 32.375 60.868
204 22.161 27.33 28.79 23.311 41.591 60.952 32.127 62.728
229 21.273 25.934 28.433 22.11 40.616 64.439 31.989 66.387
256 20.548 24.543 27.909 21.573 40.063 66.471 31.358 68.556

Macro Benchmarks

These results compare the performance to execute the following aggregation with varying intervals using the noaa dataset.

{
  "size": 0,
  "aggs": {
    "observations_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      }
    }
  }
}
Interval Num Buckets Num Hits Baseline (ms) Vectorized Linear (ms) Vectorized B-Tree (ms)
year 3 33659481 828 562 563
quarter 12 33659481 1046 887 878
month 36 33659481 1170 1408 1241

These results compare the performance to execute the following aggregation with increasing number of months.

{
  "size": 0,
  "query": {
    "range": {
      "date": {
        "gte": "2014-01-01",
        "lt": "2015-01-01"
      }
    }
  },
  "aggs": {
    "observations_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      }
    }
  }
}
Num Months Num Hits Baseline (ms) Vectorized Linear (ms) Vectorized B-Tree (ms)
12 11392659 438 336 296
14 13187069 510 393 337
16 15073450 586 456 383
18 17009688 648 584 485
20 18957437 735 642 537
22 20865502 804 727 592
24 22745413 874 789 629
26 24548269 927 877 717
28 26395510 960 985 798
30 28239883 1018 1085 896
32 30094407 1090 1163 960
34 31896193 1148 1303 1162
36 33659481 1199 1413 1255

Hot Methods Profile

To understand why 36-months' aggregation isn't seeing the expected speed-up, I took a profile of hot methods.

70.69% jdk/incubator/vector/Long512Vector$Long512Mask.firstTrue ()I Inlined
6.63% org/opensearch/common/util/ReorganizingLongHash.find (J)J Inlined
5.33% org/opensearch/common/BidirectionalLinearSearchArrayRounding.round (J)J Inlined
3.46% jdk/incubator/vector/LongVector$LongSpecies.broadcastBits (J)Ljdk/incubator/vector/LongVector; Inlined
2.81% org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$5.longValue ()J Inlined
1.70% org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator$1.collect (IJ)V Inlined
1.68% org/opensearch/common/util/BigArrays$LongArrayWrapper.get (J)J Inlined
1.50% org/opensearch/search/aggregations/LeafBucketCollector.collect (I)V JIT compiled

Turns out that VectorMask::firstTrue itself performs a linear scan on the result, which almost defeats the purpose of vectorization for this use-case. While there is some improvement still, it could have been much greater if Java had intrinsified this with bit-scan instructions, which is widely available on almost all platforms.

Conclusion

TODO

Follow-Up Questions

  • What makes the vectorized implementations slower as it approaches, say, 36 buckets?
  • What makes the vectorized B-tree and Eytzinger slow-then-fast as the size increases? (possibly branch misprediction).
  • How does the performance look on a platform that supports 128-bit and 256-bit vector registers? What's the minimum vector size to use?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Search:PerformanceenhancementEnhancement or improvement to existing feature or requestv2.12.0Issues and PRs related to version 2.12.0v3.0.0Issues and PRs related to version 3.0.0

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions