Skip to content

Introduce vector field, vector query and rescoring based on them #31615

@mayya-sharipova

Description

@mayya-sharipova

Introduce a new field of type vector on which vector calculations can be done during rescoring phase

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_feature": {
          "type": "vector"   
      }
    }
  }
}

Indexing

Allow only a single value per document
Allow to index both dense and sparse vectors?

Dense form:

PUT my_index/_doc/1
{
  "my_feature":   [11.5, 10.4, 23.0]
}

Sparse form (represented as list of dimension names and values for corresponding dimensions):

PUT my_index/_doc/1
{
  "my_feature": {"1": 11.5, "5": 10.5,  "101": 23.0}
}

Query and Rescoring

Introduce a special type of vector query:

"vector" : {
   "field" : "my_feature",
    "query_vector": {"1": 3, "5": 10.5,  "101": 12}
}

This query can only be used in the rescoring context.
This query produces a score for every document in the rescoring context in the following way:

  1. If a document doesn't have a vector value for field, 0 value will be returned
  2. If a document does have a vector value for field : doc_vector, the cosine similarity between doc_vector and query_vector is calculated:
    dotProduct(doc_vector, query_vector) / (sqrt(doc_vector) * sqrt(query_vector))
POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

Internal encoding

  1. Encoding of vectors:
    Internally both dense and sparse vectors are encoded as sorted hash?
    Thus dense array is transformed:
    [4, 12] -> {0: 4, 1: 12}
    Keys are sorted, so we can iterate over them instead of calculating hash

  2. What should be values in vectors?

    • floats?
    • smaller than floats? (lost some precision here, but less index size)
  3. Vectors are encoded as binaries.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions