-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Closed
Labels
IndexingIndexing, Bulk Indexing and anything related to indexingIndexing, Bulk Indexing and anything related to indexingbugSomething isn't workingSomething isn't workingingest-pipeline
Description
Describe the bug
Doc values got updated after update_by_query call in case ingest pipeline is configured and one of processors in that pipeline has failed.
Related component
Indexing
To Reproduce
- Setup cluster with distribution OS 2.11 with following plugins: ml-commons, knn, neural. Create index with settings similar to following:
{
"settings": {
"index.knn": true,
"default_pipeline": "pipeline-test"
},
"mappings": {
"_source": {
"excludes": [
"passage_embedding"
]
},
"properties": {
"passage_embedding": {
"type": "knn_vector",
"dimension": 1536,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {
"ef_construction": 512,
"m": 8
}
}
},
"name": {
"type": "text"
},
"passage_text": {
"type": "text"
}
}
}
}
- Setup a model using remote connector of ml-commons (https://opensearch.org/docs/latest/ml-commons-plugin/remote-models/connectors/), configure it in a way it throttles requests. In our test we use openai model and configured it to accept 6 requests per minute. Get model id of that model.
- Create ingest pipeline with at least one processor that has "ignore_failures" flag "false":
PUT /_ingest/pipeline/pipeline-test
{
"description": "An NLP ingest pipeline",
"processors": [
{
"text_embedding": {
"model_id": "<model_id>",
"field_map": {
"name": "passage_embedding"
},
"ignore_failure": false
}
}
]
}
- Ingest several documents:
POST /_bulk
{ "index": { "_index": "index-test" } }
{ "name": "permission", "test": "Writing a list of random sentences is harder than I initially thought it would be.", "doc_keyword": "workable", "doc_index": 4976 }
{ "index": { "_index": "index-test" } }
{ "name": "sister", "test": "The fifty mannequin heads floating in the pool kind of freaked them out", "doc_keyword": "angry"}
{ "index": { "_index": "index-test" } }
{ "name": "hair", "test": "Too many prisons have become early coffins", "doc_keyword": "likeable", "doc_index": 2351 }
{ "index": { "_index": "index-test" } }
{ "name": "editor", "test": "Greetings from the real universe", "doc_index": 9871 }
{ "index": { "_index": "index-test" } }
{ "name": "statement", "test": "People keep telling me orange but I still prefer pink", "doc_keyword": "entire", "doc_index": 8242 }
- Check that there are no documents with empty
passage_embeddingvalue:
GET /index-test/_search
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "passage_embedding"
}
}
]
}
}
}
- Execute update_by_query request multiple times until you got an error from the model:
POST /index-test/_update_by_query
{
"query": {
"range": {
"doc_index": {
"gte": 4000,
"lte": 5000
}
}
},
"script" : {
"source": "ctx._source.doc_index++; ctx._source.doc_keyword=\"key1\";ctx._source.test=\"Text random 1\"",
"lang": "painless"
}
}
- Run check for documents with empty passage_embedding. If search has returned anything (>= 1 hits) that means there are docs without embeddings. This is not the right behavior, all docs were ingested with embeddings, and only operation that caused embeddings to disappear was
update:
GET /index-test/_search
{
"query": {
"bool": {
"must_not": [
{
"exists": {
"field": "passage_embedding"
}
}
]
}
}
}
Expected behavior
Because processor has been configured with 'ignore_failures false` we expect that update call has failed and no changes are stored.
Additional Details
Plugins
ml-commons, k-NN, neural-search
Host/Environment (please complete the following information):
- Version 2.11
Additional context
I've tried same scenario without exclude setting for "passage_embedding" field and it works as expected.
"_source": {
"excludes": [
"passage_embedding"
]
},
I assume that behind the scenes document is still updated but because all fields are "included" it copies passage_embedding field value from original document.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
IndexingIndexing, Bulk Indexing and anything related to indexingIndexing, Bulk Indexing and anything related to indexingbugSomething isn't workingSomething isn't workingingest-pipeline