Skip to content

[BUG] Update_by_query call updates document even if ingest pipeline processor has failed with exception #14337

@martin-gaievski

Description

@martin-gaievski

Describe the bug

Doc values got updated after update_by_query call in case ingest pipeline is configured and one of processors in that pipeline has failed.

Related component

Indexing

To Reproduce

  1. Setup cluster with distribution OS 2.11 with following plugins: ml-commons, knn, neural. Create index with settings similar to following:
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "pipeline-test"
    },
    "mappings": {
        "_source": {
            "excludes": [
                "passage_embedding"
            ]
        },
        "properties": {
            "passage_embedding": {
                "type": "knn_vector",
                "dimension": 1536,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 8
                    }
                }
            },
            "name": {
                "type": "text"
            },
            "passage_text": {
                "type": "text"
            }
        }
    }
}
  1. Setup a model using remote connector of ml-commons (https://opensearch.org/docs/latest/ml-commons-plugin/remote-models/connectors/), configure it in a way it throttles requests. In our test we use openai model and configured it to accept 6 requests per minute. Get model id of that model.
  2. Create ingest pipeline with at least one processor that has "ignore_failures" flag "false":
PUT /_ingest/pipeline/pipeline-test
{
    "description": "An NLP ingest pipeline",
    "processors": [
        {
            "text_embedding": {
                "model_id": "<model_id>",
                "field_map": {
                    "name": "passage_embedding"
                },
                "ignore_failure": false
            }
        }
    ]
}
  1. Ingest several documents:
POST /_bulk
{ "index": { "_index": "index-test" } }
{ "name": "permission", "test": "Writing a list of random sentences is harder than I initially thought it would be.", "doc_keyword": "workable", "doc_index": 4976 }
{ "index": { "_index": "index-test" } }
{ "name": "sister", "test": "The fifty mannequin heads floating in the pool kind of freaked them out", "doc_keyword": "angry"}
{ "index": { "_index": "index-test" } }
{ "name": "hair", "test": "Too many prisons have become early coffins", "doc_keyword": "likeable", "doc_index": 2351  }
{ "index": { "_index": "index-test" } }
{ "name": "editor", "test": "Greetings from the real universe", "doc_index": 9871 }
{ "index": { "_index": "index-test" } }
{ "name": "statement", "test": "People keep telling me orange but I still prefer pink", "doc_keyword": "entire", "doc_index": 8242  } 
  1. Check that there are no documents with empty passage_embedding value:
GET /index-test/_search
{
    "query": {
        "bool": {
            "must_not": [
                {
                    "exists": {
                        "field": "passage_embedding"
                    }
                }
            ]
        }
    }
}
  1. Execute update_by_query request multiple times until you got an error from the model:
POST /index-test/_update_by_query
{
  "query": {
    "range": {
      "doc_index": {
        "gte": 4000,
        "lte": 5000
      }
    }
  },
  "script" : {
    "source": "ctx._source.doc_index++; ctx._source.doc_keyword=\"key1\";ctx._source.test=\"Text random 1\"",
    "lang": "painless"
  }
}
  1. Run check for documents with empty passage_embedding. If search has returned anything (>= 1 hits) that means there are docs without embeddings. This is not the right behavior, all docs were ingested with embeddings, and only operation that caused embeddings to disappear was update :
GET /index-test/_search
{
    "query": {
        "bool": {
            "must_not": [
                {
                    "exists": {
                        "field": "passage_embedding"
                    }
                }
            ]
        }
    }
}

Expected behavior

Because processor has been configured with 'ignore_failures false` we expect that update call has failed and no changes are stored.

Additional Details

Plugins
ml-commons, k-NN, neural-search

Host/Environment (please complete the following information):

  • Version 2.11

Additional context
I've tried same scenario without exclude setting for "passage_embedding" field and it works as expected.

        "_source": {
            "excludes": [
                "passage_embedding"
            ]
        },

I assume that behind the scenes document is still updated but because all fields are "included" it copies passage_embedding field value from original document.

Metadata

Metadata

Assignees

Labels

IndexingIndexing, Bulk Indexing and anything related to indexingbugSomething isn't workingingest-pipeline

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions