Skip to content

Add BF16 scalar quantization support for FAISS-backed k-NN indices.#3190

Open
mulugetam wants to merge 2 commits intoopensearch-project:mainfrom
mulugetam:bf16
Open

Add BF16 scalar quantization support for FAISS-backed k-NN indices.#3190
mulugetam wants to merge 2 commits intoopensearch-project:mainfrom
mulugetam:bf16

Conversation

@mulugetam
Copy link
Copy Markdown
Contributor

@mulugetam mulugetam commented Mar 19, 2026

Description

This PR adds BFloat16 (BF16) scalar quantization support for FAISS-backed k-NN indices. Key changes/additions:

  • Implement bulk BF16 vector similarity implementation for inner product and L2 distance.
  • Add AVX512-BF16 SIMD kernels for BF16 vector similarity (for avx512_spr with up to 45% speed up on IP).
  • Register "bf16" as a FAISS SQ encoder type in KNN constants.
  • Add FaissBF16Util with validation and clipping logic for BF16 vectors.
  • Update FAISS index builders, memory-optimized searchers, and reconstructors for BF16.
  • Add FaissBF16Reconstructor for decoding BF16 quantized vectors.
  • Add integration tests for carga-to-hnsw BF16 index creation and search.
  • Add JNI-level unit tests for BF16 similarity functions.
  • Include a binary test fixture with 300 BF16 vectors of 768 dimensions.

The BF16 bulk distance implementations live in avx512_simd_similarity_function.cpp and avx512_spr_simd_similarity_function.cpp.

  • In the first one, we convert BF16 to FP32 and just do everything in FP32 for both inner product (IP) and L2.
  • In the second one, we use AVX512-BF16 instructions for IP. L2 falls back to BF16 → FP32 conversion because (1) BF16 doesn’t have a sub instruction yet, and (2) using ||a - b||^2 = ||a||^2 - 2·a·b + ||b||^2 ends up being slower.

Below are the results comparing the BF16 IP implementation in avx512 vs avx512_spr.

Source: https://gist.github.com/mulugetam/f23317bbb9057e9798b86f4d02713fd7

--------------------------------------------------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations    bytes/s        dim     elem/s   num_vecs      vec/s
--------------------------------------------------------------------------------------------------------------------
BM_Fma/384/64           0.854 us        0.852 us       825096 59.4851G/s        384 28.8413G/s         64 75.1074M/s
BM_Fma/384/256           3.37 us         3.36 us       208303 58.9559G/s        384 29.2494G/s        256 76.1704M/s
BM_Fma/768/64            1.65 us         1.64 us       426300 61.6926G/s        768 29.9115G/s         64 38.9473M/s
BM_Fma/768/256           6.58 us         6.57 us       106817  60.279G/s        768 29.9059G/s        256 38.9399M/s
BM_Fma/1536/64           3.23 us         3.23 us       218414 62.7522G/s     1.536k 30.4253G/s         64 19.8082M/s
BM_Fma/1536/256          12.8 us         12.8 us        54715   62.04G/s     1.536k 30.7795G/s        256 20.0388M/s
BM_DpBf16/384/64        0.610 us        0.609 us      1147908 81.9618G/s        384 40.3504G/s         64 105.079M/s
BM_DpBf16/384/256        2.41 us         2.41 us       291026 82.0689G/s        384 40.8748G/s        256 106.445M/s
BM_DpBf16/768/64         1.17 us         1.17 us       601518 85.5797G/s        768 42.1316G/s         64 54.8588M/s
BM_DpBf16/768/256        4.54 us         4.53 us       155123 87.0562G/s        768 43.3588G/s        256 56.4567M/s
BM_DpBf16/1536/64        2.31 us         2.31 us       304559 86.6281G/s     1.536k 42.6477G/s         64 27.7654M/s
BM_DpBf16/1536/256       9.00 us         8.98 us        77854 87.8933G/s     1.536k 43.7756G/s        256 28.4998M/s

Related Issues

#3189

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 96.77419% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.19%. Comparing base (167cc88) to head (3c33cb6).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...optsearch/faiss/FaissIndexScalarQuantizedFlat.java 50.00% 0 Missing and 1 partial ⚠️
...arch/faiss/reconstruct/FaissBF16Reconstructor.java 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3190      +/-   ##
============================================
+ Coverage     83.10%   83.19%   +0.08%     
- Complexity     4168     4198      +30     
============================================
  Files           447      449       +2     
  Lines         15317    15367      +50     
  Branches       1965     1978      +13     
============================================
+ Hits          12729    12784      +55     
- Misses         1797     1799       +2     
+ Partials        791      784       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mulugetam mulugetam force-pushed the bf16 branch 4 times, most recently from 415c1f9 to 5512f4e Compare March 20, 2026 07:55
@mulugetam
Copy link
Copy Markdown
Contributor Author

mulugetam commented Mar 20, 2026

The table below compares bulk similarity results comparing:

  • BM_FP16 -- existing FP16 implementation
  • BM_BF16 -- included in this PR avx512_simd_similarity
  • BM_FP32 -- "ground truth"

The delta shows the difference, in the average score, from the ground truth.

Source: https://gist.github.com/mulugetam/c04a80f048e0f42520e245cc9dd615e7

----------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time            CPU   Iterations      score_min      score_max      score_avg      delta
----------------------------------------------------------------------------------------------------------------------------------------------------
BM_FP32_IP/dim:348/numVecs:64                  16627.1 ns     16608.1 ns        42172     -12.094973      11.835379      -0.115094     0.0000
BM_FP32_IP/dim:348/numVecs:256                 66454.6 ns     66179.7 ns        10588     -12.561736      17.609922       0.203629     0.0000
BM_FP32_IP/dim:768/numVecs:64                  41617.8 ns     41549.1 ns        16755     -23.029579      20.356783      -0.159242     0.0000
BM_FP32_IP/dim:768/numVecs:256                166283.7 ns    165966.6 ns         4218     -29.475765      25.550085       0.194226     0.0000
BM_FP32_IP/dim:1536/numVecs:64                 88270.9 ns     88165.9 ns         7920     -25.994150      23.933500      -0.593281     0.0000
BM_FP32_IP/dim:1536/numVecs:256               352265.7 ns    351657.7 ns         1969     -33.999916      32.676365      -0.659965     0.0000
BM_FP32_L2/dim:348/numVecs:64                  16965.0 ns     16936.2 ns        41503     200.982468     257.338165     230.966003     0.0000
BM_FP32_L2/dim:348/numVecs:256                 67886.2 ns     67739.3 ns        10372     184.938354     268.275879     230.821060     0.0000
BM_FP32_L2/dim:768/numVecs:64                  41874.4 ns     41823.0 ns        16796     465.449432     566.249695     514.804504     0.0000
BM_FP32_L2/dim:768/numVecs:256                167611.5 ns    167374.6 ns         4154     465.449432     582.071472     514.450500     0.0000
BM_FP32_L2/dim:1536/numVecs:64                 88518.0 ns     88369.0 ns         7946     957.362366    1088.597778    1024.700562     0.0000
BM_FP32_L2/dim:1536/numVecs:256               354245.1 ns    353595.4 ns         1987     957.282104    1099.622437    1025.945435     0.0000
BM_FP16_IP/dim:348/numVecs:64                    792.8 ns       790.2 ns       891489     -12.094796      11.837462      -0.114988     0.0917
BM_FP16_IP/dim:348/numVecs:256                  3193.0 ns      3184.8 ns       221860     -12.561245      17.608433       0.203593     0.0174
BM_FP16_IP/dim:768/numVecs:64                   1594.1 ns      1590.7 ns       441886     -23.029171      20.356787      -0.159039     0.1271
BM_FP16_IP/dim:768/numVecs:256                  6320.8 ns      6309.0 ns       110838     -29.478863      25.550512       0.194271     0.0228
BM_FP16_IP/dim:1536/numVecs:64                  3065.3 ns      3059.3 ns       225541     -25.995598      23.932552      -0.593526     0.0413
BM_FP16_IP/dim:1536/numVecs:256                12298.2 ns     12274.3 ns        56004     -34.001713      32.675831      -0.659997     0.0050
BM_FP16_L2/dim:348/numVecs:64                    978.4 ns       976.9 ns       718439     200.977905     257.338928     230.965210     0.0003
BM_FP16_L2/dim:348/numVecs:256                  4110.0 ns      4103.4 ns       170998     184.935822     268.276886     230.821289     0.0001
BM_FP16_L2/dim:768/numVecs:64                   2002.3 ns      1999.3 ns       351316     465.449707     566.244751     514.803833     0.0001
BM_FP16_L2/dim:768/numVecs:256                  7988.3 ns      7975.5 ns        88107     465.449707     582.074768     514.450317     0.0000
BM_FP16_L2/dim:1536/numVecs:64                  3986.3 ns      3979.4 ns       175302     957.367432    1088.599854    1024.701660     0.0001
BM_FP16_L2/dim:1536/numVecs:256                15954.6 ns     15907.4 ns        44256     957.280579    1099.629395    1025.945557     0.0000
BM_BF16_IP/dim:348/numVecs:64                    782.7 ns       781.5 ns       899809     -12.063955      11.804710      -0.113893     1.0427
BM_BF16_IP/dim:348/numVecs:256                  3245.5 ns      3240.1 ns       216869     -12.528446      17.564714       0.203129     0.2452
BM_BF16_IP/dim:768/numVecs:64                   1656.6 ns      1654.4 ns       427408     -22.960255      20.301678      -0.158550     0.4342
BM_BF16_IP/dim:768/numVecs:256                  6539.0 ns      6525.6 ns       107082     -29.387886      25.490051       0.193595     0.3251
BM_BF16_IP/dim:1536/numVecs:64                  3161.2 ns      3156.4 ns       217072     -25.925896      23.870255      -0.597385     0.6916
BM_BF16_IP/dim:1536/numVecs:256                12475.2 ns     12450.0 ns        54291     -33.880539      32.600971      -0.657721     0.3399
BM_BF16_L2/dim:348/numVecs:64                   1003.0 ns      1001.7 ns       701712     200.464600     256.661743     230.383896     0.2520
BM_BF16_L2/dim:348/numVecs:256                  4005.6 ns      3995.4 ns       173795     184.490585     267.596924     230.241714     0.2510
BM_BF16_L2/dim:768/numVecs:64                   2053.4 ns      2050.9 ns       343077     464.324890     564.767395     513.519165     0.2497
BM_BF16_L2/dim:768/numVecs:256                  8193.1 ns      8178.8 ns        85962     464.324890     580.589600     513.168213     0.2493
BM_BF16_L2/dim:1536/numVecs:64                  4007.8 ns      3996.5 ns       175314     954.962891    1085.887207    1022.150391     0.2489
BM_BF16_L2/dim:1536/numVecs:256                15983.2 ns     15946.0 ns        44147     954.962891    1096.839966    1023.372864     0.2508

@vamshin
Copy link
Copy Markdown
Member

vamshin commented Mar 20, 2026

@mulugetam thanks for the PR. I could not understand the min_score, max_score in the above benchmarks. So are we getting all the scores of top k and then take average out of these and see difference in recall?

Wondering if we can do experiments with datasets like Cohere 10M, 768D or datasets here. We could then see recall difference between both the approaches BM_FP16 (existing) and BM_BF16 (this PR)

@mulugetam
Copy link
Copy Markdown
Contributor Author

mulugetam commented Mar 20, 2026

@vamshin The scores are just the min, max, and average distances from the query vector to the database vectors in my bulk similarity benchmark. The goal is just to see the loss in precision in the bulk similarity for FP16 and BF16 against FP32, nothing more.

Yeah, I’m already working on Cohere and will share the data once it’s ready.

@navneet1v
Copy link
Copy Markdown
Collaborator

@mulugetam can you also share the setup on how you are running the benchmarks? we can also try reproducing it on our side? Or if you can contribute directly in the repo that will be awesome.

@mulugetam
Copy link
Copy Markdown
Contributor Author

mulugetam commented Mar 24, 2026

@vamshin @navneet1v Below are results from running Performance768D10M (Cohere10M) with vectordbbench on a single-node setup. My primary goal was to evaluate differences in recall accuracy. Values are reported as average ± standard deviation across three runs.

vectordbbench doesn’t currently support bf16 as a valid quantization type, so I had to apply a small patch to enable it.

--ef-search = 256

Quantization Avg QPS Recall Serial P99 (ms) Serial P95 (ms)
none (fp32) 4,679.4 ± 36.1 0.9390 5.73 ± 0.06 5.47 ± 0.06
fp16 5,177.7 ± 65.3 0.9392 6.00 ± 0.00 5.67 ± 0.06
bf16 5,286.0 ± 13.4 0.9316 5.93 ± 0.06 5.63 ± 0.06

--ef-search = 512

Quantization Avg QPS Recall Serial P99 (ms) Serial P95 (ms)
none (fp32) 2,889.4 ± 365.0 0.9684 8.60 ± 0.00 8.03 ± 0.06
fp16 3,670.4 ± 115.3 0.9676 8.47 ± 0.12 7.97 ± 0.06
bf16 3,473.5 ± 204.5 0.9568 8.93 ± 0.12 8.37 ± 0.06

Concurrent Latency Results

--ef-search = 256 — Concurrent Latency

Quantization Avg QPS Conc Avg (ms) Conc P95 (ms) Conc P99 (ms)
none (fp32) 4,679.4 ± 36.1 6.83 ± 0.05 8.97 ± 0.08 10.63 ± 0.13
fp16 5,177.7 ± 65.3 6.17 ± 0.08 7.82 ± 0.04 9.05 ± 0.01
bf16 5,286.0 ± 13.4 6.04 ± 0.02 7.71 ± 0.11 8.85 ± 0.17

--ef-search = 512 — Concurrent Latency

Quantization Avg QPS Conc Avg (ms) Conc P95 (ms) Conc P99 (ms)
none (fp32) 2,889.4 ± 365.0 11.19 ± 1.49 14.69 ± 1.59 17.38 ± 1.79
fp16 3,670.4 ± 115.3 8.71 ± 0.28 10.79 ± 0.15 12.40 ± 0.28
bf16 3,473.5 ± 204.5 9.22 ± 0.56 11.96 ± 0.46 13.82 ± 0.60

vectordbbench awsopensearch parameters

index parameters:

LOG_LEVEL=INFO NUM_PER_BATCH=1000 vectordbbench awsopensearch \
  --db-label cohere10mdb \
  --host "$OS_HOST" --port "$OS_PORT" \
  --user "$OS_USER" --password "$OS_PASS" \
  --case-type Performance768D10M \
  --number-of-shards 1\
  --number-of-replicas 0 \
  --k 100 \
  --metric-type COSINE \
  --m 16 \
  --ef-construction 256 \
  --ef-search 256 \
  --engine faiss \
  --quantization-type <fp16|bf16> \
  --number-of-indexing-clients 32 \
  --index-thread-qty 32 \
  --num-concurrency 32

Search parameters:

LOG_LEVEL=INFO NUM_PER_BATCH=1000 vectordbbench awsopensearch \
  --skip-drop-old --skip-load \
  --db-label cohere10mdb \
  --host "$OS_HOST" --port "$OS_PORT" \
  --user "$OS_USER" --password "$OS_PASS" \
  --case-type Performance768D10M \
  --number-of-shards 1\
  --number-of-replicas 0 \
  --k 100 \
  --metric-type COSINE \
  --m 16 \
  --ef-search <256|512> \
  --engine faiss \
  --quantization-type <fp16|bf16> \
  --num-concurrency 32

The JVM heap size was 31g

@mulugetam
Copy link
Copy Markdown
Contributor Author

mulugetam commented Mar 24, 2026

@mulugetam can you also share the setup on how you are running the benchmarks? we can also try reproducing it on our side? Or if you can contribute directly in the repo that will be awesome.

The bulk similarity benchmarks are standalone Google Benchmark tests that compare the kernels of the similarity functions. I’ve updated the results to also include the benchmark harness that was used.

@mulugetam mulugetam force-pushed the bf16 branch 3 times, most recently from e09c6df to 655ffd2 Compare March 31, 2026 16:13
Introduce BF16 as a new scalar quantizer type alongside FP16.

  - Implement bulk BF16 vector similarity implementation for inner product and L2 distance.
  - Add AVX512-BF16 SIMD kernels for BF16 vector similarity.
  - Register "bf16" as a FAISS SQ encoder type in KNN constants.
  - Add FaissBF16Util with validation and clipping logic for BF16 vectors.
  - Update FAISS index builders, memory-optimized searchers, and reconstructors for BF16.
  - Add FaissBF16Reconstructor for decoding BF16 quantized vectors.
  - Add integration tests for carga-to-hnsw BF16 index creation and search.
  - Add JNI-level unit tests for BF16 similarity functions.
  - Include a binary test fixture with 300 BF16 vectors of 768 dimenstions.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

Refactor based on recent changes.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants