Avoid negative scores returned from multi_match query with cross_fields#13829
Merged
msfroh merged 4 commits intoopensearch-project:mainfrom May 31, 2024
Merged
Conversation
a50898c to
2353a42
Compare
Contributor
|
❌ Gradle check result for a50898c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
msfroh
commented
May 25, 2024
Contributor
Author
|
For some context, I came up with this fix after talking myself through the logic of the (previously) failing test in https://github.com/opensearch-project/OpenSearch/pull/13627/files#r1614240922 |
9 tasks
Contributor
|
@msfroh mind please backport to 2.x manually? thank you |
msfroh
added a commit
to msfroh/OpenSearch
that referenced
this pull request
Jun 5, 2024
…lds` (opensearch-project#13829) Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the (nonzero) value of `N` for each field and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <froh@amazon.com> --------- Signed-off-by: Michael Froh <froh@amazon.com> (cherry picked from commit fffd101)
msfroh
added a commit
to msfroh/OpenSearch
that referenced
this pull request
Jun 5, 2024
…lds` (opensearch-project#13829) Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the (nonzero) value of `N` for each field and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <froh@amazon.com> --------- Signed-off-by: Michael Froh <froh@amazon.com> (cherry picked from commit fffd101)
9 tasks
Contributor
Author
|
Backport PR is ready: #13983 |
msfroh
added a commit
that referenced
this pull request
Jun 6, 2024
parv0201
pushed a commit
to parv0201/OpenSearch
that referenced
this pull request
Jun 10, 2024
…lds` (opensearch-project#13829) Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the (nonzero) value of `N` for each field and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <froh@amazon.com> --------- Signed-off-by: Michael Froh <froh@amazon.com>
3 tasks
This was referenced Jun 27, 2024
kkewwei
pushed a commit
to kkewwei/OpenSearch
that referenced
this pull request
Jul 24, 2024
…lds` (opensearch-project#13829) (opensearch-project#13983) Signed-off-by: kkewwei <kkewwei@163.com>
6 tasks
wdongyu
pushed a commit
to wdongyu/OpenSearch
that referenced
this pull request
Aug 22, 2024
…lds` (opensearch-project#13829) Under specific circumstances, when using `cross_fields` scoring on a `multi_match` query, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula. Specifically, the IDF is calculated as: ``` log(1 + (N - n + 0.5) / (n + 0.5)) ``` where `N` is the number of documents containing the field and `n` is the number of documents containing the given term in the field. Obviously, `n` should always be less than or equal to `N`. Unfortunately, `cross_fields` makes up a new value for `n` and tries to use it across all fields. This change finds the (nonzero) value of `N` for each field and uses that as an upper bound for the new value of `n`. Signed-off-by: Michael Froh <froh@amazon.com> --------- Signed-off-by: Michael Froh <froh@amazon.com>
This was referenced Sep 6, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Under specific circumstances, when using
cross_fieldsscoring on amulti_matchquery, we can end up with negative scores from the inverse document frequency calculation in the BM25 formula.Specifically, the IDF is calculated as:
where
Nis the number of documents containing the field andnis the number of documents containing the given term in the field. Obviously,nshould always be less than or equal toN.Unfortunately,
cross_fieldsmakes up a new value fornand tries to use it across all fields.This change finds the minimum (nonzero) value of
Nand uses that as an upper bound for the new value ofn.Related Issues
Resolves #7860
Check List
New functionality has been documented.New functionality has javadoc addedAPI changes companion pull request created.Public documentation issue/PR createdBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.