-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[BUG] synonym_graph filter fails with word_delimiter_graph when using whitespace or classic tokenizer in synonym_analyzer – similar to #16263 #18037
Description
Describe the bug
I’m encountering a bug similar to #16263 while configuring analyzers that use both word_delimiter_graph and synonym_graph. I'm currently migrating from Solr to OpenSearch 2.19 and encountered a limitation while working with the synonym_graph filter that uses a custom synonym_analyzer(whitespace tokenizer).
When I define a simple synonym analyzer using the whitespace tokenizer (i.e., no_split_synonym_analyzer) and apply the synonym_graph filter using this, everything works as expected.
However, the moment I add any additional filters such as word_delimiter_graph, asciifolding, or hunspell, I encounter the following error:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
},
"status": 400
}
Use Case
In our Solr configuration, we handle synonym normalization for terms like:
covid, covid-19, covid 19
skydiving, sky diving, sky-diving
handheld, hand-held
This works seamlessly there even when using filters like WordDelimiterGraphFilterFactory, Hunspell, etc.
We want to achieve similar behavior in OpenSearch, including using a synonym_graph filter with a custom analyzer that includes:
word_delimiter_graph (with preserve_original or catenate_all)
asciifolding (with preserve_original)
hunspell
and a pattern_replace filter
Sample Config (Works):
"analyzer": {
"test_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_synonym_graph-replacement_filter"
]
},
"no_split_synonym_analyzer": {
"type": "custom",
"tokenizer": "whitespace"
}
}
Sample Config (Fails):
When adding custom_word_delimiter, asciifolding, or hunspell to the same analyzer:
"test_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_word_delimiter",
"custom_hunspell_stemmer",
"custom_synonym_graph-replacement_filter"
]
}
Results in:
Token filter [custom_word_delimiter] cannot be used to parse synonyms
It would be great if OpenSearch could enhance the synonym_graph behavior to:
Allow more flexible use of filters in synonym_analyzer, especially word_delimiter_graph, which is commonly used in language normalization pipelines.
A similar issue was resolved in the past here: #16263 — perhaps this one can be handled in a similar fashion.
Related component
Indexing
To Reproduce
Create Mapping
{
"settings": {
"analysis": {
"char_filter": {
"custom_pattern_replace": {
"type": "pattern_replace",
"pattern": "[({.,\\[\\]“”/})]",
"replacement": " "
}
},
"filter": {
"custom_ascii_folding": {
"type": "asciifolding",
"preserve_original": true
},
"custom_pattern_replace_filter": {
"type": "pattern_replace",
"pattern": "(-)",
"replacement": " ",
"all": true
},
"custom_synonym_graph-replacement_filter": {
"type": "synonym_graph",
"synonyms": [
"laptop, notebook",
"covid, covid-19, covid 19",
"skydiving,sky diving,sky-diving",
"handheld,hand-held"
],
"synonym_analyzer": "no_split_synonym_analyzer"
},
"custom_word_delimiter": {
"type": "word_delimiter_graph",
"generate_word_parts": true,
"catenate_all": true,
"split_on_numerics": false,
"split_on_case_change": false
},
"custom_hunspell_stemmer": {
"type": "hunspell",
"locale": "en_US"
}
},
"analyzer": {
"test_analyzer": {
"type": "custom",
"char_filter": [
"custom_pattern_replace"
],
"tokenizer": "whitespace",
"filter": [
"custom_ascii_folding",
"lowercase",
"custom_word_delimiter",
"custom_hunspell_stemmer",
"custom_synonym_graph-replacement_filter",
"custom_pattern_replace_filter",
"flatten_graph"
]
},
"no_split_synonym_analyzer":{
"type":"custom",
"tokenizer":"whitespace"
}
}
}
}
}
Error:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
}
],
"type": "illegal_argument_exception",
"reason": "Token filter [custom_word_delimiter] cannot be used to parse synonyms"
},
"status": 400
}
Expected behavior
- The synonym_graph filter with whitespace/classic tokenizer should support analyzers that use filters like word_delimiter_graph, asciifolding, or hunspell in the main analyzer chain.
- It should not throw errors when a custom synonym_analyzer is provided.
- Currently, it works only if the synonym_analyzer uses the standard tokenizer with other filters.
- It should also work with whitespace or classic tokenizer, allowing more flexibility
Additional Details
Host/Environment (please complete the following information):
- Opensearch Version:2.19