phrase_prefix query includes IDF scores for expanded terms that don't exist in the document

### Describe the bug

# Bug: `phrase_prefix` query includes IDF scores for expanded terms that don't exist in the document

## Summary

When using `multi_match` with `type: phrase_prefix`, OpenSearch computes and includes IDF scores for expanded terms that **do not exist in the matching document**. This artificially inflates document scores and produces incorrect relevance rankings.

## Environment

- **OpenSearch Version**: 2.11
- **Query Type**: `multi_match` with `type: phrase_prefix`

## Problem Description

When searching for "Data manipulation with Strings", documents that don't match the course name are ranking higher than documents that do match, despite the course name field having a higher boost (`^3.0` vs `^1.5`).

The root cause: `phrase_prefix` expands the last term ("strings") into many possible terms, and OpenSearch includes IDF scores for **all** expanded terms in the score calculation, even when those terms don't appear in the document.

## Query

```json
GET /learningobject/_search
{
  "explain": true, 
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "accountId": {
              "value": "xxxxxx"
            }
          }
        },
        {
          "multi_match": {
            "query": "Data manipulation with Strings",
            "fields": [
              "courseDescription_en_AU^1.5",
              "courseDescription_en_GB^1.5",
              "courseDescription_en_US^1.5",
              "courseName_en_AU^3.0",
              "courseName_en_GB^3.0",
              "courseName_en_US^3.0",
              "courseOverview_en_AU^1.5",
              "courseOverview_en_GB^1.5",
              "courseOverview_en_US^1.5"
            ],
            "type": "phrase_prefix",
            "operator": "OR",
            "slop": 0,
            "prefix_length": 0,
            "max_expansions": 350,
            "zero_terms_query": "NONE",
            "auto_generate_synonyms_phrase_query": true,
            "fuzzy_transpositions": false,
            "boost": 1
          }
        }
      ]
    }
  }
}
```

## Example: Incorrect Behavior

### Top Result (Score: 126.02) - **WRONG**

**Document**: "Invoke Method and Invoke Code in Studio (v2024.10)-2"
- **Course Name**: Does NOT contain "Data manipulation with Strings"
- **Course Overview**: Contains "Data manipulation with strings in Studio" (line 121)

**Explanation shows**:
```
weight(courseOverview_en_US:"data manipulation with (stringscreating stringstreams strings.xml strings stringsbuild stringslesson stringsoperator stringsâ)" in 495)
```

**IDF scores included for terms NOT in document**:
- `stringscreating` - IDF: 13.730124 (n=2 documents) - **NOT IN DOCUMENT**
- `stringstreams` - IDF: 14.240951 (n=1 document) - **NOT IN DOCUMENT**
- `stringsbuild` - IDF: 10.850926 (n=44 documents) - **NOT IN DOCUMENT**
- `stringslesson` - IDF: 14.240951 (n=1 document) - **NOT IN DOCUMENT**
- `stringsoperator` - IDF: 14.240951 (n=1 document) - **NOT IN DOCUMENT**
- `strings.xml` - IDF: 5.877608 (n=6430 documents) - **NOT IN DOCUMENT**

**Actual content in document**: Only contains "Data manipulation with strings in Studio"

### Third Result (Score: 109.73) - **CORRECT**

**Document**: "Data manipulation with Strings in Studio (v2024.10)-1"
- **Course Name**: Contains "Data manipulation with Strings" ✅
- **Course Overview**: Contains "Data manipulation with Strings in Studio"

**Explanation shows**:
```
weight(courseName_en_US:"data manipulation with strings" in 2780)
```

This document correctly matches the course name but ranks lower due to inflated scores from non-existent terms.

## Core Issue

**According to my understanding, IDF (Inverse Document Frequency) should only be computed for terms that actually exist in the document being scored.**

The current behavior:
1. `phrase_prefix` expands "strings" → `[strings, stringscreating, stringstreams, strings.xml, stringsbuild, ...]`
2. OpenSearch computes IDF for **all** expanded terms
3. IDF scores are summed and included in the document score
4. Terms that don't exist in the document still contribute to the score

This violates the fundamental principle of TF-IDF scoring: **if a term doesn't appear in a document, it should contribute zero to that document's score**.

## Expected Behavior

IDF should only be computed and included for terms that:
1. Are part of the expanded query terms
2. **AND** actually appear in the document being scored

Terms that don't exist in the document should contribute `0` to the IDF sum, not their global IDF value.

## Impact

- **Incorrect relevance ranking**: Documents that don't match the query rank higher than documents that do
- **Field boost ignored**: Higher boosts on important fields (like course name) are negated by inflated scores from non-existent terms
- **Poor search quality**: Users see irrelevant results at the top

## Workaround

Using `type: phrase` instead of `phrase_prefix` avoids the expansion issue, but loses prefix matching functionality.

## Additional Context

The explanation output shows the IDF sum includes contributions from terms with very high IDF values (14.24 for terms appearing in only 1 document), which significantly inflates scores even though these terms don't exist in the matched document.


### Related component

_No response_

### To Reproduce

The explain for the document that gets the highest score

        "_explanation": {
          "value": 126.020096,
          "description": "sum of:",
          "details": [
            {
              "value": 1,
              "description": "accountId:[132311 TO 132311]",
              "details": []
            },
            {
              "value": 125.020096,
              "description": "max of:",
              "details": [
                {
                  "value": 125.020096,
                  "description": """weight(courseOverview_en_US:"data manipulation with (stringscreating stringstreams strings.xml strings stringsbuild stringslesson stringsoperator stringsâ)" in 495) [PerFieldSimilarity], result of:""",
                  "details": [
                    {
                      "value": 125.020096,
                      "description": "score(freq=1.0), computed as boost * idf * tf from:",
                      "details": [
                        {
                          "value": 3.3000002,
                          "description": "boost",
                          "details": []
                        },
                        {
                          "value": 108.989235,
                          "description": "idf, sum of:",
                          "details": [
                            {
                              "value": 2.0193868,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 304683,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 5.18248,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 12886,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 0.6347812,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 1216677,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 13.730124,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 2,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 14.240951,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 10.850926,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 44,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 5.877608,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 6430,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 13.730124,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 2,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 14.240951,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 14.240951,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 14.240951,
                              "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "n, number of documents containing term",
                                  "details": []
                                },
                                {
                                  "value": 2295394,
                                  "description": "N, total number of documents with field",
                                  "details": []
                                }
                              ]
                            }
                          ]
                        },
                        {
                          "value": 0.347602,
                          "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details": [
                            {
                              "value": 1,
                              "description": "phraseFreq=1.0",
                              "details": []
                            },
                            {
                              "value": 1.2,
                              "description": "k1, term saturation parameter",
                              "details": []
                            },
                            {
                              "value": 0.75,
                              "description": "b, length normalization parameter",
                              "details": []
                            },
                            {
                              "value": 216,
                              "description": "dl, length of field (approximate)",
                              "details": []
                            },
                            {
                              "value": 123.28348,
                              "description": "avgdl, average length of field",
                              "details": []
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }

### Expected behavior

Expecting that expansions that are not part of the document should not be contributing to the score

### Additional Details

**Plugins**
Please list all plugins currently enabled.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Host/Environment (please complete the following information):**
 - OS: [e.g. iOS]
 - Version [e.g. 22]

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phrase_prefix query includes IDF scores for expanded terms that don't exist in the document #20272

Describe the bug

Bug: `phrase_prefix` query includes IDF scores for expanded terms that don't exist in the document

Summary

Environment

Problem Description

Query

Example: Incorrect Behavior

Top Result (Score: 126.02) - WRONG

Third Result (Score: 109.73) - CORRECT

Core Issue

Expected Behavior

Impact

Workaround

Additional Context

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

phrase_prefix query includes IDF scores for expanded terms that don't exist in the document #20272

Description

Describe the bug

Bug: phrase_prefix query includes IDF scores for expanded terms that don't exist in the document

Summary

Environment

Problem Description

Query

Example: Incorrect Behavior

Top Result (Score: 126.02) - WRONG

Third Result (Score: 109.73) - CORRECT

Core Issue

Expected Behavior

Impact

Workaround

Additional Context

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bug: `phrase_prefix` query includes IDF scores for expanded terms that don't exist in the document