Store vector L2 norm to restore original vectors for Faiss cosine with derived source#3216
Conversation
|
Among several implementation approaches, I believe this implementation is the best option, though there are a few
Happy to address any feedback! |
e29f585 to
0fcdcaa
Compare
…iss cosine with derived source When Faiss engine with cosine similarity and derived source are both enabled, vectors are L2-normalized at index time, causing _source to return normalized vectors instead of the originals. This change stores the pre-normalization L2 norm as a FloatDocValuesField and uses it to denormalize vectors when reconstructing _source from doc values or translog. Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
Signed-off-by: Takuo Kuroki <kurotaku9679.sub@gmail.com>
0fcdcaa to
b452d43
Compare
|
@krocky-cooky thanks for raising the PR. Will start reviewing the PR. cc: @Vikasht34 , @shatejas , @kotwanikunal |
kotwanikunal
left a comment
There was a problem hiding this comment.
Thanks for making the changes. Took an initial pass through the changes.
I'll revisit it again.
| * @return the denormalized vector | ||
| */ | ||
| public static float[] denormalize(float[] vector, float norm, boolean inplace) { | ||
| float[] result = inplace ? vector : Arrays.copyOf(vector, vector.length); |
There was a problem hiding this comment.
nit: add in checks here to prevent NPEs
public static float[] denormalize(float[] vector, float norm, boolean inplace) {
Objects.requireNonNull(vector, "vector must not be null");
if (norm <= 0 || !Float.isFinite(norm)) {
throw new IllegalArgumentException("norm must be a positive finite value, got: " + norm);
}
...
}
| static DerivedSourceNormSupplier fromDocValues(CheckedSupplier<NumericDocValues, IOException> supplier) { | ||
| return (docId) -> { | ||
| NumericDocValues dv = supplier.get(); | ||
| dv.advance(docId); |
There was a problem hiding this comment.
This could go past to NO_MORE_DOCS if we are missing the norm in doc values (compatibility or otherwise).
Can you add in checks to ensure it returns the correct value? Or better use advanceExact - https://lucene.apache.org/core/10_4_0/core/org/apache/lucene/index/NumericDocValues.html#advanceExact(int)
| getVectorValidator().validateVector(array); | ||
| getVectorTransformer().transform(array, true); | ||
| context.doc().addAll(getFieldsForFloatVector(array, isDerivedEnabled(context))); | ||
| VectorTransformer transformer = getVectorTransformer(); |
There was a problem hiding this comment.
Shouldn't this be limited just to the Cosine space? Given that we have access to VectorDataType?
| return KNNVectorFieldMapperUtil.deserializeStoredVector(vectorBytesRef, vectorDataType); | ||
| Object deserialized = KNNVectorFieldMapperUtil.deserializeStoredVector(vectorBytesRef, vectorDataType); | ||
| if (deserialized instanceof float[] floatVector) { | ||
| float norm = normSupplier.get(); |
There was a problem hiding this comment.
nit: This code seems duplicated. Consider moving it into a method
float[] vector = ...
float norm = normSupplier.get();
if (norm != 1.0f) {
KNNVectorUtil.denormalize(vector, norm, true);
}
| ); | ||
|
|
||
| // Index documents with non-unit vectors | ||
| Random random = new Random(42); |
There was a problem hiding this comment.
Can we also test with [0, 0,...] values? Just to ensure we cover the multiplicative factors correctly?
| * @param isDerivedEnabled boolean to indicate if derived source is enabled | ||
| */ | ||
| public DerivedKnnFloatVectorField(String name, float[] vector, boolean isDerivedEnabled) { | ||
| public DerivedKnnFloatVectorField(String name, float[] vector, boolean isDerivedEnabled, float norm) { |
There was a problem hiding this comment.
Can we maintain the original signature with a 1.0 value?
public DerivedKnnFloatVectorField(String name, float[] vector, boolean isDerivedEnabled) {
...
this.vectorNorm = 1.0f;
}
Description
Problem
When Faiss engine with cosine similarity and derived source are both enabled, vectors are L2-normalized in-place
during indexing. Since derived source reconstructs
_sourcefrom doc values, it returns normalized vectors insteadof the user's original input.
Solution
Store the pre-normalization L2 norm as a
FloatDocValuesField(_knn_norm_<field_name>) alongside the vector. On_sourcereconstruction, read the norm and multiply it back to recover the original vector.Key changes
Write path:
VectorTransformer/NormalizeVectorTransformer: AddedcomputeL2Norm()to calculate norm before normalizationKNNVectorFieldMapper.parseCreateField: Computes norm, passes it togetFieldsForFloatVectorwhich adds aFloatDocValuesFieldwhen derived source is enabled and norm ≠ 1.0DerivedSourceIndexOperationListener: Denormalizes vectors before injecting into translog sourceKNN10010DerivedSourceStoredFieldsWriter: Records norm field names in segment attributesRead path:
DerivedSourceNormSupplier: Functional interface withUNIT(no-op) andfromDocValues(reads fromNumericDocValues) implementationsPerFieldDerivedVectorTransformerFactory: Looks up norm FieldInfo and creates the appropriateDerivedSourceNormSupplierAbstractPerFieldDerivedVectorTransformer.formatVector: Lazily reads norm and applies denormalization on bothbyte[] (legacy BinaryDocValues) and float[] (Lucene-based vector format) paths
Utilities:
KNNVectorUtil: Addeddenormalize(vector, norm, inplace)andgetNormFieldName(fieldName)Backward compatibility
Related Issues
Resolves #3083
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here