Skip to content

Use Collector.setWeight to improve aggregation performance (for special cases) #10954

@msfroh

Description

@msfroh

Lucene added a new setWeight method to the Collector interface a while back (see https://issues.apache.org/jira/browse/LUCENE-10620), specifically to give collectors access to the Weight.count() method.

Weight.count() only has a few cases where it returns things other than -1 (the value meaning "I can't give you a cheap count"), but the cases where it does return are pretty useful -- mostly "match all" or "match none", but for a single term query will return "I match exactly this many", if there are no deletions in the current segment (since it just reads the term's doc freq).

I believe this can be useful to short-circuit some aggregation logic, since aggregations all extend Collector.

These are the special cases that I've been able to think of where the weight.count(leafReaderContext) could hint at smarter computation of aggregations:

  1. If the top-level query matches nothing int the current segment (i.e. weight.count(leafReaderContext) == 0), then count for every bucket is 0. (If the min count is greater than 0, then you don't need to compute any buckets for this segment.)
  2. If the top-level query matches everything in the current segment (i.e. weight.count(leafReaderContext) == leafReaderContext.reader().maxDoc()), then the count of hits in a bucket (from the current segment) is determined entirely by the count of the bucket, which may be cheap to compute (e.g. doc freq for a terms aggregation, maybe read count from the BKD tree for a range aggregation).
  3. If the top-level query has some other positive count, but a bucket matches everything in the current segment (e.g. the documents in the current segment are all from the same day and we're computing a daily date histogram), then the bucket count is weight.count(leafReaderContext).

I didn't give it a lot of thought, so there might be some more that I'm missing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Search:PerformanceenhancementEnhancement or improvement to existing feature or requestv2.13.0Issues and PRs related to version 2.13.0v3.0.0Issues and PRs related to version 3.0.0

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions