Description
It is a common need to run some logic after a segment has been collected. Even though, I can't find previous instances of this discussion I'm pretty sure that this has been raised several times in the past, and the answer was essentially that this logic can easily be implemented on top of Lucene. One good example of this is our own FacetsCollector, which collects the set of matching docs per segment: getLeafCollector appends the set of doc IDs that were collected on the previous segment to the set, and getMatchingDocs takes care of the last segment, since getLeafCollector doesn't get called anymore after the last segment has been collected.
However, this approach is not perfect. If you are leveraging Lucene's concurrent search capabilities, this forces the post collection logic to run in the current thread for at least one segment per slice, instead of using the executor. This is a missed opportunity for search concurrency, since post collection logic is not always cheap. For instance, in the case of FacetsCollector it needs to run DocIdSetBuilder.build() which may need to sort a large array of doc IDs. Having a LeafCollector.postCollect() API or something along these lines would help address this issue, as postCollect() would get called on the IndexSearcher's executor.
I looked at our collectors to get a sense of how many of our collectors could take advantage of a postCollect() hook and found the following ones:
org.apache.lucene.facet.FacetsCollector
org.apache.lucene.search.grouping.BlockGroupingCollector
org.apache.lucene.search.grouping.TermGroupFacetCollector
org.apache.lucene.search.suggest.document.TopSuggestDocsCollector
org.apache.lucene.search.CachingCollector
Description
It is a common need to run some logic after a segment has been collected. Even though, I can't find previous instances of this discussion I'm pretty sure that this has been raised several times in the past, and the answer was essentially that this logic can easily be implemented on top of Lucene. One good example of this is our own
FacetsCollector, which collects the set of matching docs per segment:getLeafCollectorappends the set of doc IDs that were collected on the previous segment to the set, andgetMatchingDocstakes care of the last segment, sincegetLeafCollectordoesn't get called anymore after the last segment has been collected.However, this approach is not perfect. If you are leveraging Lucene's concurrent search capabilities, this forces the post collection logic to run in the current thread for at least one segment per slice, instead of using the executor. This is a missed opportunity for search concurrency, since post collection logic is not always cheap. For instance, in the case of
FacetsCollectorit needs to runDocIdSetBuilder.build()which may need to sort a large array of doc IDs. Having aLeafCollector.postCollect()API or something along these lines would help address this issue, aspostCollect()would get called on theIndexSearcher'sexecutor.I looked at our collectors to get a sense of how many of our collectors could take advantage of a
postCollect()hook and found the following ones:org.apache.lucene.facet.FacetsCollectororg.apache.lucene.search.grouping.BlockGroupingCollectororg.apache.lucene.search.grouping.TermGroupFacetCollectororg.apache.lucene.search.suggest.document.TopSuggestDocsCollectororg.apache.lucene.search.CachingCollector