Skip to content

Store vector L2 norm to restore original vectors for Faiss cosine with derived source#3216

Open
krocky-cooky wants to merge 6 commits intoopensearch-project:mainfrom
krocky-cooky:feat/return-original-vec-3083
Open

Store vector L2 norm to restore original vectors for Faiss cosine with derived source#3216
krocky-cooky wants to merge 6 commits intoopensearch-project:mainfrom
krocky-cooky:feat/return-original-vec-3083

Conversation

@krocky-cooky
Copy link
Copy Markdown

Description

Problem

When Faiss engine with cosine similarity and derived source are both enabled, vectors are L2-normalized in-place
during indexing. Since derived source reconstructs _source from doc values, it returns normalized vectors instead
of the user's original input.

Solution

Store the pre-normalization L2 norm as a FloatDocValuesField (_knn_norm_<field_name>) alongside the vector. On _source reconstruction, read the norm and multiply it back to recover the original vector.

Key changes

Write path:

  • VectorTransformer / NormalizeVectorTransformer: Added computeL2Norm() to calculate norm before normalization
  • KNNVectorFieldMapper.parseCreateField: Computes norm, passes it to getFieldsForFloatVector which adds a
    FloatDocValuesField when derived source is enabled and norm ≠ 1.0
  • DerivedSourceIndexOperationListener: Denormalizes vectors before injecting into translog source
  • KNN10010DerivedSourceStoredFieldsWriter: Records norm field names in segment attributes

Read path:

  • DerivedSourceNormSupplier: Functional interface with UNIT (no-op) and fromDocValues (reads from NumericDocValues) implementations
  • PerFieldDerivedVectorTransformerFactory: Looks up norm FieldInfo and creates the appropriate DerivedSourceNormSupplier
  • AbstractPerFieldDerivedVectorTransformer.formatVector: Lazily reads norm and applies denormalization on both
    byte[] (legacy BinaryDocValues) and float[] (Lucene-based vector format) paths

Utilities:

  • KNNVectorUtil: Added denormalize(vector, norm, inplace) and getNormFieldName(fieldName)

Backward compatibility

  • Segments without norm attributes fall back to norm = 1.0 (no denormalization)
  • Only Faiss + cosinesimil + derived source triggers norm storage; all other configurations are unaffected

Related Issues

Resolves #3083

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here

@krocky-cooky
Copy link
Copy Markdown
Author

Among several implementation approaches, I believe this implementation is the best option, though there are a few
considerations:

  • Due to floating-point rounding errors, there will be minor differences between the original vector and the vector returned in _source.
    Since the difference is less than 1e-4, this should not be a significant issue in practice.
  • The norm is stored as doc values under _knn_norm_<vector_field>, which incurs additional storage usage.
    Since it is a single floating-point value and is only stored under the cosinesimil + faiss condition, the impact is
    minimal.
  • During the read phase, there is additional overhead from reading the norm from disk.
    Since the norm is stored in
    the same file as the vectors, OS-level caching can be leveraged, keeping the impact small.

Happy to address any feedback!

@krocky-cooky krocky-cooky force-pushed the feat/return-original-vec-3083 branch from e29f585 to 0fcdcaa Compare March 27, 2026 01:12
…iss cosine with derived source

When Faiss engine with cosine similarity and derived source are both enabled,
vectors are L2-normalized at index time, causing _source to return normalized
vectors instead of the originals. This change stores the pre-normalization L2
norm as a FloatDocValuesField and uses it to denormalize vectors when
reconstructing _source from doc values or translog.

Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
@krocky-cooky krocky-cooky force-pushed the feat/return-original-vec-3083 branch from 0fcdcaa to b452d43 Compare March 27, 2026 01:18
@navneet1v
Copy link
Copy Markdown
Collaborator

@krocky-cooky thanks for raising the PR. Will start reviewing the PR.

cc: @Vikasht34 , @shatejas , @kotwanikunal

Copy link
Copy Markdown
Member

@kotwanikunal kotwanikunal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. Took an initial pass through the changes.
I'll revisit it again.

* @return the denormalized vector
*/
public static float[] denormalize(float[] vector, float norm, boolean inplace) {
float[] result = inplace ? vector : Arrays.copyOf(vector, vector.length);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add in checks here to prevent NPEs

public static float[] denormalize(float[] vector, float norm, boolean inplace) {
   Objects.requireNonNull(vector, "vector must not be null");
   if (norm <= 0 || !Float.isFinite(norm)) {
       throw new IllegalArgumentException("norm must be a positive finite value, got: " + norm);
   }
  ...
}

static DerivedSourceNormSupplier fromDocValues(CheckedSupplier<NumericDocValues, IOException> supplier) {
return (docId) -> {
NumericDocValues dv = supplier.get();
dv.advance(docId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could go past to NO_MORE_DOCS if we are missing the norm in doc values (compatibility or otherwise).
Can you add in checks to ensure it returns the correct value? Or better use advanceExact - https://lucene.apache.org/core/10_4_0/core/org/apache/lucene/index/NumericDocValues.html#advanceExact(int)

getVectorValidator().validateVector(array);
getVectorTransformer().transform(array, true);
context.doc().addAll(getFieldsForFloatVector(array, isDerivedEnabled(context)));
VectorTransformer transformer = getVectorTransformer();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be limited just to the Cosine space? Given that we have access to VectorDataType?

return KNNVectorFieldMapperUtil.deserializeStoredVector(vectorBytesRef, vectorDataType);
Object deserialized = KNNVectorFieldMapperUtil.deserializeStoredVector(vectorBytesRef, vectorDataType);
if (deserialized instanceof float[] floatVector) {
float norm = normSupplier.get();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This code seems duplicated. Consider moving it into a method

float[] vector = ...
float norm = normSupplier.get();
if (norm != 1.0f) {
        KNNVectorUtil.denormalize(vector, norm, true);
}

);

// Index documents with non-unit vectors
Random random = new Random(42);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test with [0, 0,...] values? Just to ensure we cover the multiplicative factors correctly?

* @param isDerivedEnabled boolean to indicate if derived source is enabled
*/
public DerivedKnnFloatVectorField(String name, float[] vector, boolean isDerivedEnabled) {
public DerivedKnnFloatVectorField(String name, float[] vector, boolean isDerivedEnabled, float norm) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maintain the original signature with a 1.0 value?

public DerivedKnnFloatVectorField(String name, float[] vector, boolean isDerivedEnabled) {
    ...
    this.vectorNorm = 1.0f;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Enable cosine similarity to return original vectors rather than normalized vectors with Faiss Engine

3 participants