Skip to content

[Feature Request] Set terminate_after to trackTotalHitsUpTo for speedup in boolean queries with only filter clauses #18510

@peteralfonsi

Description

@peteralfonsi

Is your feature request related to a problem? Please describe

Background

This idea is vaguely inspired by Lucene's ImpactsDISI + MaxScoreCache. On eligible queries, Lucene allows doc id set iterators to start skipping blocks of docs that have non-competitive scores, once the collector for the query already has trackTotalHitsUpTo hits (10k by default). This is acceptable since after trackTotalHitsUpTo hits, we've given up counting the total number of hits, and there's no reason to check if non-competitive docs match the query at all. If docs have equal scores, ties are broken by picking the one with lower doc id, which means later docs can safely be skipped.

This only applies in certain cases (exactly 1 must clause which uses impacts, plus stuff about its cost relative to other clauses) but when it works it can have huge performance improvements since it can let you skip most documents. On simple 2-clause http_logs query this can be as much as a 90x speedup (2494 -> 28 ms).
Example query:

"bool": {
  "filter": {
    "match": {
      "status": "200"
    }
  },
  "must": {
    "match": {
      "request": "images"
    }
  }
}

Idea

Lucene does not skip non-competitive docs in an all-filter case. (It's using ScoreMode.COMPLETE_NO_SCORES instead of ScoreMode.TOP_SCORES). So, the above query but with both clauses being filter can take much longer than 28 ms.

But, in a filter clause, all documents are equally competitive, since scores don't matter. Like before, we break ties by choosing the doc with lower doc id, and we don't care about counting exact hits past trackTotalHitsUpTo. So, as I understand it, there's no benefit in continuing to iterate past trackTotalHitsUpTo matching documents. If we stopped early we should get similar speedups that ImpactsDISI gives us.

(This doesn't apply if we're doing aggs, pagination, sorting, or other similar things, but for a simple query I think it's correct).

I think this speedup would apply to a much broader class of boolean queries than ImpactsDISI especially since we're planning on automatically rewriting certain must clauses --> filter clauses as described in #17586.

Describe the solution you'd like

We already have a mechanism to stop iteration early after a certain number of collected docs: the query param terminate_after. This lives in the SearchContext.

Ideally there would be some way for a top-level boolean query to signal to the SearchContext to set this value to trackTotalHitsUpTo if, after rewriting is done, there are only filter clauses. (I think this should also still work if there are must_not clauses). This should be possible at least in theory since QueryBuilder rewriting and conversion to Query is completed before we construct the Collector, which is the thing that needs to accept terminate_after.

I ran a benchmark testing the same all-filter queries with and without terminate_after and found some good speedups. Details in the Additional Context section.

Related component

Search:Performance

Describe alternatives you've considered

This could also be implemented at the Lucene level, but it might be tricker since I think there's no shared SearchContext type of object, and it seems tricky to get queries to talk to the parts of the code building the collectors.

Additional context

Benchmark results on http_logs are below. I didn't implement this suggestion yet, I'm just running some all-filter queries in OSB (branch link) and explicitly setting terminate_after on some of them. I also ran some where terminate_after is 1k instead of the default 10k to see if that'd have a significantly larger impact than 10k.

Query Current p50 (ms) p50 with terminate_after=10k Speedup p50 with terminate_after=1k Speedup
status matches 200 & request matches "images" 851 12.6 68x 10.5 81x
status matches 404 & request matches "images" 40.7 15.5 2.6x 12.4 3.3x
@timestamp between 6/10-6/13 & request matches "images" 454 10.9 42x 10.6 43x
request.raw in list of top 10 terms & @timestamp between 6/10-6/13 216 9.2 23x 8.1 27x

We can see the speedup is greatest when the queries matched a lot of docs, like status=200.

Using 1k instead of the default 10k helps a bit but clearly the main effect is from enabling terminate_after at all.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

🆕 New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions