-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Describe the bug
Bug: phrase_prefix query includes IDF scores for expanded terms that don't exist in the document
Summary
When using multi_match with type: phrase_prefix, OpenSearch computes and includes IDF scores for expanded terms that do not exist in the matching document. This artificially inflates document scores and produces incorrect relevance rankings.
Environment
- OpenSearch Version: 2.11
- Query Type:
multi_matchwithtype: phrase_prefix
Problem Description
When searching for "Data manipulation with Strings", documents that don't match the course name are ranking higher than documents that do match, despite the course name field having a higher boost (^3.0 vs ^1.5).
The root cause: phrase_prefix expands the last term ("strings") into many possible terms, and OpenSearch includes IDF scores for all expanded terms in the score calculation, even when those terms don't appear in the document.
Query
GET /learningobject/_search
{
"explain": true,
"query": {
"bool": {
"must": [
{
"term": {
"accountId": {
"value": "xxxxxx"
}
}
},
{
"multi_match": {
"query": "Data manipulation with Strings",
"fields": [
"courseDescription_en_AU^1.5",
"courseDescription_en_GB^1.5",
"courseDescription_en_US^1.5",
"courseName_en_AU^3.0",
"courseName_en_GB^3.0",
"courseName_en_US^3.0",
"courseOverview_en_AU^1.5",
"courseOverview_en_GB^1.5",
"courseOverview_en_US^1.5"
],
"type": "phrase_prefix",
"operator": "OR",
"slop": 0,
"prefix_length": 0,
"max_expansions": 350,
"zero_terms_query": "NONE",
"auto_generate_synonyms_phrase_query": true,
"fuzzy_transpositions": false,
"boost": 1
}
}
]
}
}
}Example: Incorrect Behavior
Top Result (Score: 126.02) - WRONG
Document: "Invoke Method and Invoke Code in Studio (v2024.10)-2"
- Course Name: Does NOT contain "Data manipulation with Strings"
- Course Overview: Contains "Data manipulation with strings in Studio" (line 121)
Explanation shows:
weight(courseOverview_en_US:"data manipulation with (stringscreating stringstreams strings.xml strings stringsbuild stringslesson stringsoperator stringsâ)" in 495)
IDF scores included for terms NOT in document:
stringscreating- IDF: 13.730124 (n=2 documents) - NOT IN DOCUMENTstringstreams- IDF: 14.240951 (n=1 document) - NOT IN DOCUMENTstringsbuild- IDF: 10.850926 (n=44 documents) - NOT IN DOCUMENTstringslesson- IDF: 14.240951 (n=1 document) - NOT IN DOCUMENTstringsoperator- IDF: 14.240951 (n=1 document) - NOT IN DOCUMENTstrings.xml- IDF: 5.877608 (n=6430 documents) - NOT IN DOCUMENT
Actual content in document: Only contains "Data manipulation with strings in Studio"
Third Result (Score: 109.73) - CORRECT
Document: "Data manipulation with Strings in Studio (v2024.10)-1"
- Course Name: Contains "Data manipulation with Strings" ✅
- Course Overview: Contains "Data manipulation with Strings in Studio"
Explanation shows:
weight(courseName_en_US:"data manipulation with strings" in 2780)
This document correctly matches the course name but ranks lower due to inflated scores from non-existent terms.
Core Issue
According to my understanding, IDF (Inverse Document Frequency) should only be computed for terms that actually exist in the document being scored.
The current behavior:
phrase_prefixexpands "strings" →[strings, stringscreating, stringstreams, strings.xml, stringsbuild, ...]- OpenSearch computes IDF for all expanded terms
- IDF scores are summed and included in the document score
- Terms that don't exist in the document still contribute to the score
This violates the fundamental principle of TF-IDF scoring: if a term doesn't appear in a document, it should contribute zero to that document's score.
Expected Behavior
IDF should only be computed and included for terms that:
- Are part of the expanded query terms
- AND actually appear in the document being scored
Terms that don't exist in the document should contribute 0 to the IDF sum, not their global IDF value.
Impact
- Incorrect relevance ranking: Documents that don't match the query rank higher than documents that do
- Field boost ignored: Higher boosts on important fields (like course name) are negated by inflated scores from non-existent terms
- Poor search quality: Users see irrelevant results at the top
Workaround
Using type: phrase instead of phrase_prefix avoids the expansion issue, but loses prefix matching functionality.
Additional Context
The explanation output shows the IDF sum includes contributions from terms with very high IDF values (14.24 for terms appearing in only 1 document), which significantly inflates scores even though these terms don't exist in the matched document.
Related component
No response
To Reproduce
The explain for the document that gets the highest score
"_explanation": {
"value": 126.020096,
"description": "sum of:",
"details": [
{
"value": 1,
"description": "accountId:[132311 TO 132311]",
"details": []
},
{
"value": 125.020096,
"description": "max of:",
"details": [
{
"value": 125.020096,
"description": """weight(courseOverview_en_US:"data manipulation with (stringscreating stringstreams strings.xml strings stringsbuild stringslesson stringsoperator stringsâ)" in 495) [PerFieldSimilarity], result of:""",
"details": [
{
"value": 125.020096,
"description": "score(freq=1.0), computed as boost * idf * tf from:",
"details": [
{
"value": 3.3000002,
"description": "boost",
"details": []
},
{
"value": 108.989235,
"description": "idf, sum of:",
"details": [
{
"value": 2.0193868,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 304683,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 5.18248,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 12886,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.6347812,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 1216677,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 13.730124,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 14.240951,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 1,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 10.850926,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 44,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 5.877608,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 6430,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 13.730124,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 14.240951,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 1,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 14.240951,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 1,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 14.240951,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 1,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 2295394,
"description": "N, total number of documents with field",
"details": []
}
]
}
]
},
{
"value": 0.347602,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1,
"description": "phraseFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 216,
"description": "dl, length of field (approximate)",
"details": []
},
{
"value": 123.28348,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
}
]
}
Expected behavior
Expecting that expansions that are not part of the document should not be contributing to the score
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: [e.g. iOS]
- Version [e.g. 22]
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status