[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation

## Problem Statement

In the date histogram aggregation, the timestamp value from each hit must be rounded down to the nearest interval (year, quarter, month, week, day, etc.) as defined in the search request. This rounded-down timestamp serves as the bucket key to aggregate results.

#9727 improved performance by using linear search when the array size is small (<= 64 elements). This POC evaluates if performance can be improved further using [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) (related to #9423).

### 1 — Baseline
The current approach uses linear search when the search space is small or binary search otherwise. This is because linear search is far superior for small arrays as it avoids the penalty of branch mispredictions and pipeline stalls and accesses memory sequentially. To simplify the remaining comparison, I'll assume that each value is equally likely to be picked (uniform distribution), so we need not consider bidirectional linear search in this POC.

### 2 — Vectorized Linear Search
Assume a platform that supports AVX2 instructions (i.e., 256-bit wide registers). We perform a linear search comparing 4 longs at once, taking a stride of 4 elements with each iteration. We pad the array with $\infty$ to make it an exact multiple of the number of vector lanes ($B$). The search stops once we find the first value greater than the desired key.

![simd-linear](https://github.com/opensearch-project/OpenSearch/assets/9292653/cc97d78e-3909-4003-a90b-5309e5cdfd8f)

### 3 — Vectorized B-Tree Search

Here, the idea is to improve binary search using vectorization. Since binary search compares values at non-contiguous memory locations, which is non-trivial to load into a vector register, the first step is to fix the memory layout.
1. The first idea is to use the Eytzinger layout. It is an implicit binary tree packed into an array where the left and the right sub-trees of a node at index $i$ are stored at $2i+1$ and $2i+2$ indices respectively. This is also how binary heaps are laid out in memory.
2. The second idea is to store more multiple values ($B$) per node, which makes the branching factor ($1+B$). This is essentially a B-tree! But unlike B-trees where the number of values is chosen to fit in a single cache line, we choose it to be equal to the number of vector lanes.
3. The underlying array is padded to the nearest multiple of vector lanes. We do not need to provision memory for completely empty nodes. I also experimented with a complete tree (all levels are filled except the last level, which is filled from left to right) but didn't see much difference in performance.

![simd-btree-levels](https://github.com/opensearch-project/OpenSearch/assets/9292653/e4feed49-3964-4031-83fe-7f7726fdcf06)

To find the round-down point in this layout, we start our search from the $0$-th index comparing $B$ values at once. The position of the first set bit in the vector mask tells us which sub-tree (between $0...B$) to visit to converge to the round-down point.

![simd-btree](https://github.com/opensearch-project/OpenSearch/assets/9292653/56981162-853a-4905-a33d-9163d3f2ce86)

## POC code
Code for the above implementations (and many others) can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/main/java/org/opensearch/Rounder.java

## Results

Results are from c6i.2xlarge EC2 instance (Intel Xeon 8375C CPU @ 2.90GHz) which supports full 512-bit registers (8 vector lanes for long).
```
Cpuinfo Version: 7.0.0
Vendor ID Raw: GenuineIntel
Hardware Raw:
Brand Raw: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Hz Advertised Friendly: 2.9000 GHz
Hz Actual Friendly: 3.4986 GHz
Hz Advertised: (2900000000, 0)
Hz Actual: (3498596000, 0)
Arch: X86_64
Bits: 64
Count: 16
Arch String Raw: x86_64
L1 Data Cache Size: 384 KiB (8 instances)
L1 Instruction Cache Size: 256 KiB (8 instances)
L2 Cache Size: 10 MiB (8 instances)
L2 Cache Line Size: 256
L2 Cache Associativity: 6
L3 Cache Size: 56623104
Stepping: 6
Model: 106
Family: 6
Processor Type:
Flags: 3dnowprefetch, abm, adx, aes, aperfmperf, apic, arat, arch_capabilities, avx, avx2, avx512_bitalg, avx512_vbmi2, avx512_vnni, avx512_vpopcntdq, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vpopcntdq, bmi1, bmi2, clflush, clflushopt, clwb, cmov, constant_tsc, cpuid, cx16, cx8, de, erms, f16c, flush_l1d, fma, fpu, fsgsbase, fxsr, gfni, ht, hypervisor, ibpb, ibrs, ibrs_enhanced, ida, invpcid, invpcid_single, lahf_lm, lm, mca, mce, md_clear, mmx, movbe, msr, mtrr, nonstop_tsc, nopl, nx, ospke, osxsave, pae, pat, pcid, pclmulqdq, pdpe1gb, pge, pku, pni, popcnt, pse, pse36, rdpid, rdrand, rdrnd, rdseed, rdtscp, rep_good, sep, sha, sha_ni, smap, smep, ss, ssbd, sse, sse2, sse4_1, sse4_2, ssse3, stibp, syscall, tme, tsc, tsc_adjust, tsc_deadline_timer, tsc_known_freq, tscdeadline, vaes, vme, vpclmulqdq, wbnoinvd, x2apic, xgetbv1, xsave, xsavec, xsaveopt, xsaves, xtopology
```

### Micro Benchmarks

These results compare the performance to look up round-down points for 1 million random keys picked from a uniform distribution. Since pre-computed array rounding is only used for small arrays ([code](https://github.com/opensearch-project/OpenSearch/blob/3dd9fc1064a079b593a3fc9a781f7bc5be3a0096/server/src/main/java/org/opensearch/common/Rounding.java#L416-L422)), I've limited the benchmarks to array sizes up to 256.

Code for the micro benchmark harness can be found here:
https://github.com/ketanv3/opensearch-simd-playground/blob/main/src/jmh/java/org/opensearch/RounderBenchmark.java

<img width="1791" alt="Screenshot 2023-10-05 at 9 54 31 AM" src="https://github.com/opensearch-project/OpenSearch/assets/9292653/15ff832e-6595-4d71-8179-cbae7bf4514c">

| size |   binary |   linear |   hybrid |   eytzinger |   vectorized_linear |   vectorized_eytzinger |   unrolled_eytzinger |   vectorized_btree |
|----:|---------:|---------:|---------:|------------:|--------------------:|-----------------------:|---------------------:|-------------------:|
|   1 |  295.947 |  323.76  |  301.949 |     338.849 |             244.832 |                204.102 |              253.831 |            234.225 |
|   2 |   90.548 |   93.462 |   90.892 |      91.012 |             244.903 |                204.151 |               87.779 |            234.251 |
|   3 |   71.746 |   76.643 |   76.824 |      74.248 |             244.886 |                204.12  |              106.255 |            234.136 |
|   4 |   65.067 |   68.915 |   69.974 |      61.451 |             244.737 |                204.287 |               89.979 |            234.172 |
|   5 |   55.412 |   65.889 |   66.058 |      57.786 |             244.869 |                204.388 |               79.007 |            234.26  |
|   6 |   51.574 |   64.382 |   63.812 |      53.777 |             244.905 |                204.282 |               69.455 |            234.156 |
|   7 |   48.416 |   61.33  |   62.01  |      50.647 |             244.747 |                204.372 |               65.809 |            234.111 |
|   8 |   47.702 |   60.355 |   60.123 |      47.504 |             162.481 |                204.162 |               63.483 |            233.679 |
|   9 |   45.354 |   59.581 |   59.805 |      46.404 |             121.948 |                148.625 |               56.026 |            114.642 |
|  10 |   43.744 |   58.828 |   58.401 |      44.667 |             101.297 |                120.654 |               52.898 |             99.97  |
|  12 |   40.655 |   56.81  |   56.354 |      42.518 |              82.603 |                 92.188 |               53.281 |             83.234 |
|  14 |   38.731 |   56.113 |   56.068 |      40.256 |              77.681 |                 77.288 |               51.574 |             75.347 |
|  16 |   37.615 |   54.779 |   54.723 |      37.298 |              74.428 |                 72.294 |               50.882 |             73.562 |
|  18 |   36.908 |   53.505 |   53.931 |      36.632 |              68.818 |                 78.296 |               49.609 |            124.952 |
|  20 |   35.854 |   52.526 |   52.702 |      36.131 |              65.477 |                 82.456 |               49.973 |            107.786 |
|  22 |   35.123 |   51.663 |   52.095 |      35.515 |              65.466 |                 86.55  |               50.202 |             97.138 |
|  24 |   34.347 |   50.718 |   51.107 |      34.691 |              62.389 |                 89.973 |               50.378 |             88.548 |
|  26 |   32.883 |   50.147 |   50.473 |      33.915 |              62.359 |                 97.614 |               48.697 |            136.888 |
|  29 |   30.583 |   48.334 |   49.443 |      32.826 |              60.524 |                100.794 |               46.673 |            115.876 |
|  32 |   29.595 |   47.177 |   47.937 |      31.706 |              56.593 |                 68.765 |               40.408 |             66.419 |
|  37 |   29.848 |   46.967 |   46.555 |      31.43  |              57.42  |                 72.925 |               39.564 |             77.961 |
|  41 |   29.811 |   45.541 |   45.27  |      30.857 |              52.363 |                 76.323 |               39.398 |             83.197 |
|  45 |   29.556 |   44.105 |   43.571 |      29.062 |              50.98  |                 79.744 |               39.099 |             85.851 |
|  49 |   28.931 |   42.526 |   42.466 |      29.751 |              52.5   |                 81.896 |               37.444 |             86.226 |
|  54 |   27.374 |   40.689 |   41.094 |      28.304 |              51.97  |                 85.37  |               37.668 |             87.361 |
|  60 |   25.949 |   39.313 |   39.48  |      27.921 |              50.749 |                 88.148 |               36.952 |             90.138 |
|  64 |   25.58  |   37.665 |   38.654 |      27.902 |              49.935 |                 90.495 |               35.73  |             90.597 |
|  74 |   25.846 |   37.815 |   35.785 |      27.282 |              48.425 |                 93.44  |               35.207 |             92.326 |
|  83 |   25.802 |   36.663 |   35.015 |      27.173 |              44.904 |                 88.798 |               33.979 |            107.161 |
|  90 |   25.667 |   35.771 |   34.606 |      26.868 |              46.483 |                101.417 |               34.684 |             87.596 |
|  98 |   25.189 |   35.14  |   34.186 |      26.426 |              45.846 |                 84.157 |               34.626 |             74.626 |
| 108 |   24.39  |   34.062 |   33.866 |      25.769 |              45.428 |                 72.028 |               33.819 |             67.345 |
| 118 |   23.36  |   32.94  |   33.167 |      25.148 |              45.474 |                 63.726 |               33.487 |             61.756 |
| 128 |   22.822 |   32.137 |   32.764 |      24.592 |              44.685 |                 58.101 |               33.241 |             58.654 |
| 144 |   22.918 |   31.154 |   29.949 |      24.454 |              44.492 |                 54.808 |               32.561 |             55.242 |
| 159 |   22.866 |   30.141 |   29.577 |      24.587 |              43.129 |                 54.998 |               32.347 |             55.568 |
| 171 |   22.899 |   29.35  |   29.319 |      24.223 |              42.683 |                 55.818 |               32.496 |             57.786 |
| 187 |   22.614 |   28.353 |   28.732 |      23.867 |              42.226 |                 58.316 |               32.375 |             60.868 |
| 204 |   22.161 |   27.33  |   28.79  |      23.311 |              41.591 |                 60.952 |               32.127 |             62.728 |
| 229 |   21.273 |   25.934 |   28.433 |      22.11  |              40.616 |                 64.439 |               31.989 |             66.387 |
| 256 |   20.548 |   24.543 |   27.909 |      21.573 |              40.063 |                 66.471 |               31.358 |             68.556 |

### Macro Benchmarks

These results compare the performance to execute the following aggregation with varying intervals using the [noaa dataset](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/noaa).
```json
{
  "size": 0,
  "aggs": {
    "observations_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      }
    }
  }
}
```

| Interval | Num Buckets | Num Hits | Baseline (ms) | Vectorized Linear (ms) | Vectorized B-Tree (ms) |
|----------|-------------|----------|---------------|------------------------|------------------------|
| year     | 3           | 33659481 | 828           | 562                    | 563                    |
| quarter  | 12          | 33659481 | 1046           | 887                    | 878                    |
| month    | 36          | 33659481 | 1170          | 1408                   | 1241                   |

These results compare the performance to execute the following aggregation with increasing number of months.
```json
{
  "size": 0,
  "query": {
    "range": {
      "date": {
        "gte": "2014-01-01",
        "lt": "2015-01-01"
      }
    }
  },
  "aggs": {
    "observations_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      }
    }
  }
}
```

| Num Months | Num Hits | Baseline (ms) | Vectorized Linear (ms) | Vectorized B-Tree (ms) |
|------------|----------|---------------|------------------------|------------------------|
| 12         | 11392659 | 438           | 336                    | 296                    |
| 14         | 13187069 | 510           | 393                    | 337                    |
| 16         | 15073450 | 586           | 456                    | 383                    |
| 18         | 17009688 | 648           | 584                    | 485                    |
| 20         | 18957437 | 735           | 642                    | 537                    |
| 22         | 20865502 | 804           | 727                    | 592                    |
| 24         | 22745413 | 874           | 789                    | 629                    |
| 26         | 24548269 | 927           | 877                    | 717                    |
| 28         | 26395510 | 960           | 985                    | 798                    |
| 30         | 28239883 | 1018          | 1085                   | 896                    |
| 32         | 30094407 | 1090          | 1163                   | 960                    |
| 34         | 31896193 | 1148          | 1303                   | 1162                   |
| 36         | 33659481 | 1199          | 1413                   | 1255                   |

### Hot Methods Profile

To understand why 36-months' aggregation isn't seeing the expected speed-up, I took a profile of hot methods.
```
70.69% jdk/incubator/vector/Long512Vector$Long512Mask.firstTrue ()I Inlined
6.63% org/opensearch/common/util/ReorganizingLongHash.find (J)J Inlined
5.33% org/opensearch/common/BidirectionalLinearSearchArrayRounding.round (J)J Inlined
3.46% jdk/incubator/vector/LongVector$LongSpecies.broadcastBits (J)Ljdk/incubator/vector/LongVector; Inlined
2.81% org/apache/lucene/codecs/lucene90/Lucene90DocValuesProducer$5.longValue ()J Inlined
1.70% org/opensearch/search/aggregations/bucket/histogram/DateHistogramAggregator$1.collect (IJ)V Inlined
1.68% org/opensearch/common/util/BigArrays$LongArrayWrapper.get (J)J Inlined
1.50% org/opensearch/search/aggregations/LeafBucketCollector.collect (I)V JIT compiled
```

Turns out that [VectorMask::firstTrue](https://github.com/openjdk/jdk/blob/4c5b66dceab15ce27f742c4173e14156249eb61a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java#L167-L172) itself performs a linear scan on the result, which almost defeats the purpose of vectorization for this use-case. While there is some improvement still, it could have been much greater if Java had intrinsified this with [bit-scan instructions](https://en.wikipedia.org/wiki/Find_first_set), which is widely available on almost all platforms.

## Conclusion
TODO

## Follow-Up Questions
* What makes the vectorized implementations slower as it approaches, say, 36 buckets?
* What makes the vectorized B-tree and Eytzinger slow-then-fast as the size increases? (possibly branch misprediction).
* How does the performance look on a platform that supports 128-bit and 256-bit vector registers? What's the minimum vector size to use?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

Problem Statement

1 — Baseline

2 — Vectorized Linear Search

3 — Vectorized B-Tree Search

POC code

Results

Micro Benchmarks

Macro Benchmarks

Hot Methods Profile

Conclusion

Follow-Up Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

size	binary	linear	hybrid	eytzinger	vectorized_linear	vectorized_eytzinger	unrolled_eytzinger	vectorized_btree
1	295.947	323.76	301.949	338.849	244.832	204.102	253.831	234.225
2	90.548	93.462	90.892	91.012	244.903	204.151	87.779	234.251
3	71.746	76.643	76.824	74.248	244.886	204.12	106.255	234.136
4	65.067	68.915	69.974	61.451	244.737	204.287	89.979	234.172
5	55.412	65.889	66.058	57.786	244.869	204.388	79.007	234.26
6	51.574	64.382	63.812	53.777	244.905	204.282	69.455	234.156
7	48.416	61.33	62.01	50.647	244.747	204.372	65.809	234.111
8	47.702	60.355	60.123	47.504	162.481	204.162	63.483	233.679
9	45.354	59.581	59.805	46.404	121.948	148.625	56.026	114.642
10	43.744	58.828	58.401	44.667	101.297	120.654	52.898	99.97
12	40.655	56.81	56.354	42.518	82.603	92.188	53.281	83.234
14	38.731	56.113	56.068	40.256	77.681	77.288	51.574	75.347
16	37.615	54.779	54.723	37.298	74.428	72.294	50.882	73.562
18	36.908	53.505	53.931	36.632	68.818	78.296	49.609	124.952
20	35.854	52.526	52.702	36.131	65.477	82.456	49.973	107.786
22	35.123	51.663	52.095	35.515	65.466	86.55	50.202	97.138
24	34.347	50.718	51.107	34.691	62.389	89.973	50.378	88.548
26	32.883	50.147	50.473	33.915	62.359	97.614	48.697	136.888
29	30.583	48.334	49.443	32.826	60.524	100.794	46.673	115.876
32	29.595	47.177	47.937	31.706	56.593	68.765	40.408	66.419
37	29.848	46.967	46.555	31.43	57.42	72.925	39.564	77.961
41	29.811	45.541	45.27	30.857	52.363	76.323	39.398	83.197
45	29.556	44.105	43.571	29.062	50.98	79.744	39.099	85.851
49	28.931	42.526	42.466	29.751	52.5	81.896	37.444	86.226
54	27.374	40.689	41.094	28.304	51.97	85.37	37.668	87.361
60	25.949	39.313	39.48	27.921	50.749	88.148	36.952	90.138
64	25.58	37.665	38.654	27.902	49.935	90.495	35.73	90.597
74	25.846	37.815	35.785	27.282	48.425	93.44	35.207	92.326
83	25.802	36.663	35.015	27.173	44.904	88.798	33.979	107.161
90	25.667	35.771	34.606	26.868	46.483	101.417	34.684	87.596
98	25.189	35.14	34.186	26.426	45.846	84.157	34.626	74.626
108	24.39	34.062	33.866	25.769	45.428	72.028	33.819	67.345
118	23.36	32.94	33.167	25.148	45.474	63.726	33.487	61.756
128	22.822	32.137	32.764	24.592	44.685	58.101	33.241	58.654
144	22.918	31.154	29.949	24.454	44.492	54.808	32.561	55.242
159	22.866	30.141	29.577	24.587	43.129	54.998	32.347	55.568
171	22.899	29.35	29.319	24.223	42.683	55.818	32.496	57.786
187	22.614	28.353	28.732	23.867	42.226	58.316	32.375	60.868
204	22.161	27.33	28.79	23.311	41.591	60.952	32.127	62.728
229	21.273	25.934	28.433	22.11	40.616	64.439	31.989	66.387
256	20.548	24.543	27.909	21.573	40.063	66.471	31.358	68.556

Interval	Num Buckets	Num Hits	Baseline (ms)	Vectorized Linear (ms)	Vectorized B-Tree (ms)
year	3	33659481	828	562	563
quarter	12	33659481	1046	887	878
month	36	33659481	1170	1408	1241

Num Months	Num Hits	Baseline (ms)	Vectorized Linear (ms)	Vectorized B-Tree (ms)
12	11392659	438	336	296
14	13187069	510	393	337
16	15073450	586	456	383
18	17009688	648	584	485
20	18957437	735	642	537
22	20865502	804	727	592
24	22745413	874	789	629
26	24548269	927	877	717
28	26395510	960	985	798
30	28239883	1018	1085	896
32	30094407	1090	1163	960
34	31896193	1148	1303	1162
36	33659481	1199	1413	1255

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation #10392

Description

Problem Statement

1 — Baseline

2 — Vectorized Linear Search

3 — Vectorized B-Tree Search

POC code

Results

Micro Benchmarks

Macro Benchmarks

Hot Methods Profile

Conclusion

Follow-Up Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions