Describe the bug
I used the index template below to create an index, along with a single document for testing. The document is indexed such that only the digits are kept; for example, indexing a value like “aaa123” would only retain the “123” portion.
Likewise, if a user searches for something like “zzz999”, the only token generated should be “999”.
This works for the most part. But if the user enters a query_string that contains an asterisk, then the document is returned regardless of whether it actually matches. For example, searching for:
*asdf*
produces a match with the document below, which has a value of 12345. This is in contrast to the two _analyze functions listed below, which produce tokens that don’t match each other.
Thanks in advance for looking at this issue.
Related component
Search
To Reproduce
Create an index as follows:
PUT ds1 { "settings": { "analysis": { "char_filter": { "strip_nondigits": { "type": "pattern_replace", "pattern": "\\D", "replacement": "" } }, "filter": { "remove_empty_tokens": { "type": "length", "min": 1 }, "replace_empty_with_null": { "type": "pattern_replace", "pattern": "^$", "replacement": "<NULL>" } }, "analyzer": { "special_number_analyzer": { "type": "custom", "char_filter": [ "strip_nondigits" ], "tokenizer": "keyword", "filter": [ "remove_empty_tokens" ] }, "special_number_analyzer_search": { "type": "custom", "char_filter": [ "strip_nondigits" ], "tokenizer": "keyword", "filter": [ "replace_empty_with_null" ] } } } }, "mappings": { "properties": { "special_number_field": { "type": "text", "analyzer": "special_number_analyzer", "search_analyzer": "special_number_analyzer_search" } } } }
Add a test document:
POST /_bulk?refresh=true { "index": { "_index": "ds1"} } { "special_number_field": "1234" }
Test with analyzer:
GET ds1/_analyze { "text": "*asdf*", "analyzer": "special_number_analyzer" }
correctly produces:
{ "tokens": [] }
Test with search_anlyzer:
GET ds1/_analyze { "text": "*asdf*", "analyzer": "special_number_analyzer_search" }
correctly produces:
{ "tokens": [ { "token": "<NULL>", "start_offset": 6, "end_offset": 6, "type": "word", "position": 0 } ]
The bug occurs when you run this query:
GET ds1/_search { "query": { "query_string": { "query": "*asdf*", "analyzer": "special_number_analyzer" } } }
produces this unwanted hit:
{ "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "ds1", "_id": "vB0QnJ0Bpkf8R5zRYeIl", "_score": 1, "_source": { "special_number_field": "1234" } } ] } }
Likewise, with the search_analyzer:
GET ds1/_search { "query": { "query_string": { "query": "*asdf*", "analyzer": "special_number_analyzer_search" } } }
produces this unwanted hit:
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "ds1", "_id": "vB0QnJ0Bpkf8R5zRYeIl", "_score": 1, "_source": { "special_number_field": "1234" } } ] } }
Expected behavior
The expected behaviour is that no hits should be produced, since the analyzer does not produce any tokens. If an analyzer either produces no tokens or produces tokens that do not match the tokens of an indexed document, the query_string should not return any hits regardless of whether an asterisk is used in the search.
Additional Details
Plugins
n/a
Screenshots
n/a
Host/Environment (please complete the following information):
- OpenSearch Version 2.19.4
Additional context
Add any other context about the problem here.
Describe the bug
I used the index template below to create an index, along with a single document for testing. The document is indexed such that only the digits are kept; for example, indexing a value like “aaa123” would only retain the “123” portion.
Likewise, if a user searches for something like “zzz999”, the only token generated should be “999”.
This works for the most part. But if the user enters a query_string that contains an asterisk, then the document is returned regardless of whether it actually matches. For example, searching for:
*asdf*produces a match with the document below, which has a value of 12345. This is in contrast to the two _analyze functions listed below, which produce tokens that don’t match each other.
Thanks in advance for looking at this issue.
Related component
Search
To Reproduce
Create an index as follows:
PUT ds1 { "settings": { "analysis": { "char_filter": { "strip_nondigits": { "type": "pattern_replace", "pattern": "\\D", "replacement": "" } }, "filter": { "remove_empty_tokens": { "type": "length", "min": 1 }, "replace_empty_with_null": { "type": "pattern_replace", "pattern": "^$", "replacement": "<NULL>" } }, "analyzer": { "special_number_analyzer": { "type": "custom", "char_filter": [ "strip_nondigits" ], "tokenizer": "keyword", "filter": [ "remove_empty_tokens" ] }, "special_number_analyzer_search": { "type": "custom", "char_filter": [ "strip_nondigits" ], "tokenizer": "keyword", "filter": [ "replace_empty_with_null" ] } } } }, "mappings": { "properties": { "special_number_field": { "type": "text", "analyzer": "special_number_analyzer", "search_analyzer": "special_number_analyzer_search" } } } }Add a test document:
POST /_bulk?refresh=true { "index": { "_index": "ds1"} } { "special_number_field": "1234" }Test with analyzer:
GET ds1/_analyze { "text": "*asdf*", "analyzer": "special_number_analyzer" }correctly produces:
{ "tokens": [] }Test with search_anlyzer:
GET ds1/_analyze { "text": "*asdf*", "analyzer": "special_number_analyzer_search" }correctly produces:
{ "tokens": [ { "token": "<NULL>", "start_offset": 6, "end_offset": 6, "type": "word", "position": 0 } ]The bug occurs when you run this query:
GET ds1/_search { "query": { "query_string": { "query": "*asdf*", "analyzer": "special_number_analyzer" } } }produces this unwanted hit:
{ "took": 4, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "ds1", "_id": "vB0QnJ0Bpkf8R5zRYeIl", "_score": 1, "_source": { "special_number_field": "1234" } } ] } }Likewise, with the search_analyzer:
GET ds1/_search { "query": { "query_string": { "query": "*asdf*", "analyzer": "special_number_analyzer_search" } } }produces this unwanted hit:
{ "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "ds1", "_id": "vB0QnJ0Bpkf8R5zRYeIl", "_score": 1, "_source": { "special_number_field": "1234" } } ] } }Expected behavior
The expected behaviour is that no hits should be produced, since the analyzer does not produce any tokens. If an analyzer either produces no tokens or produces tokens that do not match the tokens of an indexed document, the query_string should not return any hits regardless of whether an asterisk is used in the search.
Additional Details
Plugins
n/a
Screenshots
n/a
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.