Skip to content

[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037

@nupurjaiswal

Description

@nupurjaiswal

Describe the bug

I’m encountering a bug similar to #16263 while configuring analyzers that use both word_delimiter_graph and synonym_graph. I'm currently migrating from Solr to OpenSearch 2.19 and encountered a limitation while working with the synonym_graph filter that uses a custom synonym_analyzer(whitespace tokenizer).

When I define a simple synonym analyzer using the whitespace tokenizer (i.e., no_split_synonym_analyzer) and apply the synonym_graph filter using this, everything works as expected.

However, the moment I add any additional filters such as word_delimiter_graph, asciifolding, or hunspell, I encounter the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
  },
  "status": 400
}

Use Case
In our Solr configuration, we handle synonym normalization for terms like:

covid, covid-19, covid 19

skydiving, sky diving, sky-diving

handheld, hand-held

This works seamlessly there even when using filters like WordDelimiterGraphFilterFactory, Hunspell, etc.

We want to achieve similar behavior in OpenSearch, including using a synonym_graph filter with a custom analyzer that includes:

word_delimiter_graph (with preserve_original or catenate_all)

asciifolding (with preserve_original)

hunspell

and a pattern_replace filter

Sample Config (Works):

"analyzer": {
  "test_analyzer": {
    "type": "custom",
    "tokenizer": "whitespace",
    "filter": [
      "lowercase",
      "custom_synonym_graph-replacement_filter"
    ]
  },
  "no_split_synonym_analyzer": {
    "type": "custom",
    "tokenizer": "whitespace"
  }
}

Sample Config (Fails):
When adding custom_word_delimiter, asciifolding, or hunspell to the same analyzer:

"test_analyzer": {
  "type": "custom",
  "tokenizer": "whitespace",
  "filter": [
    "lowercase",
    "custom_word_delimiter",
    "custom_hunspell_stemmer",
    "custom_synonym_graph-replacement_filter"
  ]
}

Results in:

Token filter [custom_word_delimiter] cannot be used to parse synonyms

It would be great if OpenSearch could enhance the synonym_graph behavior to:

Allow more flexible use of filters in synonym_analyzer, especially word_delimiter_graph, which is commonly used in language normalization pipelines.

A similar issue was resolved in the past here: #16263 — perhaps this one can be handled in a similar fashion.

Related component

Indexing

To Reproduce

Create Mapping

{
  "settings": {
    "analysis": {
        "char_filter": {
        "custom_pattern_replace": {
          "type": "pattern_replace",
          "pattern": "[({.,\\[\\]“”/})]",
          "replacement": " "
        }
        },
      "filter": {
        "custom_ascii_folding": {
          "type": "asciifolding",
          "preserve_original": true
        },
        "custom_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "(-)",
          "replacement": " ",
          "all": true
        },
        "custom_synonym_graph-replacement_filter": {
          "type": "synonym_graph",
          "synonyms": [
            "laptop, notebook",
            "covid, covid-19, covid 19",
            "skydiving,sky diving,sky-diving",
            "handheld,hand-held"
          ],
         "synonym_analyzer": "no_split_synonym_analyzer"
        },
         "custom_word_delimiter": {
          "type": "word_delimiter_graph",
          "generate_word_parts": true,
          "catenate_all": true,
          "split_on_numerics": false,
          "split_on_case_change": false
        },
        "custom_hunspell_stemmer": {
          "type": "hunspell",
          "locale": "en_US"
        }
      },
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "char_filter": [
            "custom_pattern_replace"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "custom_ascii_folding",
            "lowercase",
            "custom_word_delimiter",
            "custom_hunspell_stemmer",
            "custom_synonym_graph-replacement_filter",
            "custom_pattern_replace_filter",
            "flatten_graph"
          ]
        },
         "no_split_synonym_analyzer":{
            "type":"custom",
            "tokenizer":"whitespace"
        }
      }
    }
  }
}

Error:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
    },
    "status": 400
}

Expected behavior

  • The synonym_graph filter with whitespace/classic tokenizer should support analyzers that use filters like word_delimiter_graph, asciifolding, or hunspell in the main analyzer chain.
  • It should not throw errors when a custom synonym_analyzer is provided.
  • Currently, it works only if the synonym_analyzer uses the standard tokenizer with other filters.
  • It should also work with whitespace or classic tokenizer, allowing more flexibility

Additional Details

Host/Environment (please complete the following information):

  • Opensearch Version:2.19

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingIndexing, Bulk Indexing and anything related to indexingbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions