Skip to content

[BUG] Create Index Fails for index using simple_pattern tokenizer #20349

@mgodwan

Description

@mgodwan

Describe the bug

For simple_pattern tokenizer, if the automaton generated during initialization is an a non-deterministic automaton, the index creation fails. This has regressed recently as the same regex pattern earlier used to work in older OpenSearch versions.

Related component

Other

To Reproduce

  • Create an index
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "some_tokenizer": {
            "type": "simple_pattern",
            "pattern": "[A-Za-z0-9]+|[^\n - ]"
          }
        },
        "analyzer": {
          "some_analyzer": {
            "type": "custom",
            "tokenizer": "some_tokenizer",
            "filter": ["lowercase"]
          }
        }
      },
      "number_of_shards": "1",
      "number_of_replicas": "0"
    }
  },
  "mappings": {
    "dynamic": "false",
    "properties": {
      "field1": {
        "type": "text",
        "analyzer": "some_analyzer"
      }
    }
  }
}
  • See the below error
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"please determinize the incoming automaton first"}],"type":"illegal_argument_exception","reason":"please determinize the incoming automaton first"},"status":400}

Expected behavior

Index creation should go through.

Additional Details

Plugins
analysis-commons (module)

Screenshots
NA

Additional context
This is working till OpenSearch 2.x and has broken in OpenSearch 3.x. This is related to github.com/apache/lucene/pull/485 which does not determinize the Automaton generated by default anymore.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions