-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Closed
Labels
Description
Describe the bug
For simple_pattern tokenizer, if the automaton generated during initialization is an a non-deterministic automaton, the index creation fails. This has regressed recently as the same regex pattern earlier used to work in older OpenSearch versions.
Related component
Other
To Reproduce
- Create an index
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"some_tokenizer": {
"type": "simple_pattern",
"pattern": "[A-Za-z0-9]+|[^\n - ]"
}
},
"analyzer": {
"some_analyzer": {
"type": "custom",
"tokenizer": "some_tokenizer",
"filter": ["lowercase"]
}
}
},
"number_of_shards": "1",
"number_of_replicas": "0"
}
},
"mappings": {
"dynamic": "false",
"properties": {
"field1": {
"type": "text",
"analyzer": "some_analyzer"
}
}
}
}
- See the below error
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"please determinize the incoming automaton first"}],"type":"illegal_argument_exception","reason":"please determinize the incoming automaton first"},"status":400}
Expected behavior
Index creation should go through.
Additional Details
Plugins
analysis-commons (module)
Screenshots
NA
Additional context
This is working till OpenSearch 2.x and has broken in OpenSearch 3.x. This is related to github.com/apache/lucene/pull/485 which does not determinize the Automaton generated by default anymore.
Reactions are currently unavailable