Do not pass negative scores into function_score or script_score queries by msfroh · Pull Request #13627 · opensearch-project/OpenSearch

msfroh · 2024-05-10T22:05:57Z

In theory, Lucene scores should never go negative. To stop users from writing function_score and script_score queries that return negative values, we explicitly check their outputs and throw an exception when negative.

Unfortunately, due to a subtle, more complicated bug in multi_match queries, sometimes those might (incorrectly) return negative scores.

While that problem is also worth solving, we should protect function and script scoring from throwing an exception just for passing through a negative value that they had no hand in computing.

Description

[Describe what this change achieves]

Related Issues

Resolves #7860

Check List

New functionality includes testing.
- All tests pass
~~New functionality has been documented.~~
- ~~New functionality has javadoc added~~
~~API changes companion pull request created.~~
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
~~Public documentation issue/PR created~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-05-10T23:03:59Z

❌ Gradle check result for 0f6c9b1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-05-13T18:09:00Z

❕ Gradle check result for 1081bb9: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2024-05-13T18:10:55Z

Codecov Report

Attention: Patch coverage is 33.33333% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 71.49%. Comparing base (b15cb0c) to head (528bbd5).
Report is 309 commits behind head on main.

❗ Current head 528bbd5 differs from pull request most recent head 983e43c

Please upload reports for the commit 983e43c to get more accurate results.

Files	Patch %	Lines
...on/lucene/search/function/ScriptScoreFunction.java	0.00%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #13627      +/-   ##
============================================
+ Coverage     71.42%   71.49%   +0.07%     
- Complexity    59978    61132    +1154     
============================================
  Files          4985     5059      +74     
  Lines        282275   287523    +5248     
  Branches      40946    41646     +700     
============================================
+ Hits         201603   205560    +3957     
- Misses        63999    64970     +971     
- Partials      16673    16993     +320

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-05-13T20:23:25Z

❌ Gradle check result for 31c61c7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

mingshl · 2024-05-13T20:23:58Z

modules/lang-painless/src/yamlRestTest/resources/rest-api-spec/test/painless/30_search.yml

+              }]
+          explain: true
+  - match: { hits.total.value: 3 }
+  - match: { hits.hits.2._score: 0.0 }


nice, so the functional score on a document will be at least zero, instead of negative. This will fix the score per document level.

Wondering if this also fix the per field level score is negative? can we run profile api on yaml test and check perFieldSimilarity score is negative or zero?

The logic focuses on the input score passed to a function or script, regardless of where that score came from.

Both before and after this change, we have checks that the score returned by a function or script is not negative (and throw an exception if it is). Unfortunately, this meant that a function/script that just passes a score through would throw an exception if the input score is negative.

So, for any case where a function or script gets an illegal negative value as input, this change defensively treats it as zero.

awesome! thank you

msfroh · 2024-05-13T21:39:46Z

Gradle check failed due to #13600. Retrying...

github-actions · 2024-05-13T22:11:54Z

❌ Gradle check result for 31c61c7:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

reta · 2024-05-14T00:01:34Z

server/src/main/java/org/opensearch/common/lucene/search/function/ScriptScoreFunction.java

+        public void score(float subScore) {
+            // We check to make sure the script score function never makes a score negative, but we need to make
+            // sure the script score function does not receive negative input.
+            this.score = Math.max(0.0f, subScore);


@msfroh don't we try to hide the problem instead of fixing the source of negative scoring? As you mentioned, we do catch that in some cases (and thrown an exception) but it slips in other, why cannot we fix that?

There was some discussion about this on #7860.

Essentially, the scoring logic in cross_fields match is pretty complicated. Actually, the Elasticsearch documentation for cross_fields even contains the following warning:

The cross_fields type blends field statistics in a complex way that can be hard to interpret. The score combination can even be incorrect, in particular when some documents contain some of the search fields, but not all of them.

As @ylwu-amzn called out on the issue, any change we make to the cross_fields scoring formula could impact people's "regular" search results. We'd need to be extremely careful.

Meanwhile, this change just turns a query that was throwing an exception (which is what people have complained about) into one that doesn't.

We could possibly wrap the query returned from MatchQueryBuilder in a custom scorer that clamps negative values to zero. Functionally that's what happens if tie_breaker is 0. I would still include the changes in this PR as an extra layer of protection.

@reta -- I spent some more time today trying to prevent the negative score from bubbling up. So far as I can tell, this is the query that's returning the negative scores:

OpenSearch/server/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java

Line 410 in 4a42150

return new DisjunctionMaxQuery(queries, tieBreakerMultiplier);

Unfortunately, that just returns a Lucene DisjunctionMaxQuery, which normally does the correct thing. The problem arises because of the adjustDF and/or the adjustTTF methods. Let's pretend that Lucene term stats are something other than what they really are... yay!

I believe it's specifically adjustTTF method (called from the blend method) that breaks us, because it lowers the sumTotalTermFreq to the minimum of sumTotalTermFreq across all of the fields, which may be lower than any individual field's term's max termFreq (from another field). In theory, the shared sumTotalTermFreq should never be lower than the maximum of termFreq for any doc across all fields, but we don't have a good way of getting that value without iterating through all documents matched by the terms. I'd love to add a line just before here saying minSumTTF = Math.max(minSumTTF, maxTermFreq).

tl;dr: Preventing the negative scores from getting into function/script scoring functions is hard. This cross_field scoring is a complicated mess. While it is the root cause, the immediate pain point is not the negative scores, but that the negative scores make perfectly valid function/script scoring throw an exception.

I'd like to commit this fix as-is to address the immediate problem. We can address the complicated cross_field logic down the road.

When we enter the blend() method for the integ test I wrote, we start off with color:red with df=3, sumTtf=3, while shape:red has df=1,sumTtf=1. Also, maxDoc is 3.

We exit the first loop with minSumTTF=2 and max = 3. We set maxDoc to 2 (since it's bigger than minSumTTF).

Later, we set actualDf to 2.

The relevant piece of the explain output is:

{ "value" : -0.22314355, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 2, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 1, "description" : "N, total number of documents with field", "details" : [ ] } ] },

That n, number of documents containing the term came from actualDf.

I think I can actually fix it by guaranteeing that actualDf is never greater than N, which I can get from reader.getDocCount(terms[i].field()).

So, after saying that fixing the underlying problem is too hard, I think I can fix the underlying problem instead. 😁

Here we go: #13829

It may still be worth merging this PR as an extra layer of protection, but I'd need to update / remove the YAML REST test, because we no longer get a negative score that we force to zero. Alternatively, we could just close this PR, since the underlying problem is (probably) fixed.

Thanks a lot for looking into this @msfroh (I learn quite a few things from your clear explanation)

github-actions · 2024-05-14T18:14:23Z

❕ Gradle check result for 528bbd5: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

In theory, Lucene scores should never go negative. To stop users from writing `function_score` and `script_score` queries that return negative values, we explicitly check their outputs and throw an exception when negative. Unfortunately, due to a subtle, more complicated bug in multi_match queries, sometimes those might (incorrectly) return negative scores. While that problem is also worth solving, we should protect function and script scoring from throwing an exception just for passing through a negative value that they had no hand in computing. Signed-off-by: Michael Froh <froh@amazon.com>

github-actions · 2024-05-25T02:16:55Z

❌ Gradle check result for 983e43c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

msfroh · 2024-05-31T00:56:23Z

Canceling this one in favor of #13829

Thanks again @reta for pushing me to solve the underlying problem!

github-actions bot added bug Something isn't working Search:Relevance labels May 10, 2024

msfroh force-pushed the suppress_negative_input_scores branch from 0f6c9b1 to 1081bb9 Compare May 13, 2024 17:17

msfroh added the backport 2.x Backport to 2.x branch label May 13, 2024

msfroh marked this pull request as ready for review May 13, 2024 18:27

msfroh requested review from Bukhtawar, CEHENKLE, Rishikesh1159, VachaShah, anasalkouz, andrross, dblock, dbwiddis, dreamer-89, gbbafna, kotwanikunal, mch2, nknize, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja, sohami and tlfeng as code owners May 13, 2024 18:27

msfroh force-pushed the suppress_negative_input_scores branch from 1081bb9 to 31c61c7 Compare May 13, 2024 19:31

mingshl reviewed May 13, 2024

View reviewed changes

reta reviewed May 14, 2024

View reviewed changes

msfroh force-pushed the suppress_negative_input_scores branch from 31c61c7 to 528bbd5 Compare May 14, 2024 17:15

msfroh force-pushed the suppress_negative_input_scores branch from 528bbd5 to 983e43c Compare May 25, 2024 01:26

msfroh closed this May 31, 2024

prudhvigodithi mentioned this pull request Jun 18, 2024

[AUTOCUT] Gradle Check Flaky Test Report for IndicesRequestCacheIT prudhvigodithi/OpenSearch#27

Closed

This was referenced Jun 27, 2024

[AUTOCUT] Gradle Check Flaky Test Report for IndicesRequestCacheIT #14288

Open

[AUTOCUT] Gradle Check Flaky Test Report for ResourceAwareTasksTests #14293

Closed

Conversation

msfroh commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

github-actions bot commented May 10, 2024

Uh oh!

github-actions bot commented May 13, 2024

Uh oh!

codecov bot commented May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented May 13, 2024

Uh oh!

mingshl May 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msfroh May 13, 2024

Choose a reason for hiding this comment

Uh oh!

mingshl May 13, 2024

Choose a reason for hiding this comment

Uh oh!

msfroh commented May 13, 2024

Uh oh!

github-actions bot commented May 13, 2024

Uh oh!

reta May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msfroh May 14, 2024

Choose a reason for hiding this comment

Uh oh!

msfroh May 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msfroh May 25, 2024

Choose a reason for hiding this comment

Uh oh!

msfroh May 25, 2024

Choose a reason for hiding this comment

Uh oh!

reta May 27, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 14, 2024

Uh oh!

github-actions bot commented May 25, 2024

Uh oh!

msfroh commented May 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

msfroh commented May 10, 2024 •

edited

Loading

codecov bot commented May 13, 2024 •

edited

Loading

mingshl May 13, 2024 •

edited

Loading

reta May 14, 2024 •

edited

Loading

msfroh May 25, 2024 •

edited

Loading