🐛 fix(search-api): stable result order and configurable per-file hit limit#4935
Merged
vladak merged 3 commits intooracle:masterfrom Apr 22, 2026
Merged
Conversation
7640a91 to
bc9271a
Compare
bc9271a to
f566d92
Compare
Contributor
Author
|
@vladak this should be ready for review now, thanks! |
vladak
reviewed
Apr 16, 2026
Member
|
Is this possibly related to #3239 ? |
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
reviewed
Apr 16, 2026
vladak
requested changes
Apr 16, 2026
b5985e9 to
c3571c8
Compare
…limit Two bugs caused /api/v1/search to return different results than the HTML search page for the same query, breaking clients migrating from HTML scraping to the REST API. Also adds maxhitsperfile parameter for callers that need to bound response size. Collectors.groupingBy() uses HashMap so the file order in the JSON results object is non-deterministic. With any maxresults cap, different orderings produce different result sets across calls. Switching to LinkedHashMap preserves the Lucene scoring order consistent with the sort parameter. SearchEngine applied limit = nhits > 100 to cap matching lines per file when a query matches many documents — a cap the HTML page never applied. For a heavily-referenced symbol this meant the API silently dropped most matching lines per file. The fix replaces the boolean cap with a maxhitsperfile query parameter (default 0 = unlimited, matching the HTML page). Callers that need to bound per-file hits can pass a positive value. The apiary documents both the result ordering guarantee and the new maxhitsperfile parameter.
Move hit-per-file limiting from post-hoc trimming in SearchEngine into Context.getContext() via a new maxHits parameter. Fix apiary Note line that was parsed as a parameter. Use QueryParameters constants in tests, add URL null check, extract variables, improve test names and comments.
The note text between Parameters and Response sections was parsed as an unrecognized block by drafter. Action descriptions must precede the Parameters section in API Blueprint format.
c3571c8 to
ee83d2f
Compare
Member
|
Thanks ! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two bugs in /api/v1/search produce results that disagree with the HTML search page for identical queries, making the API unsuitable as a drop-in replacement for clients migrating off HTML scraping.
Search results change between calls with identical parameters
Running the same search twice with maxresults=20 can return different files each time, even when the index has not changed. A pipeline paging through results risks missing files or processing duplicates. This happens because the API returns files in an arbitrary order that varies between JVM runs, so trimming at maxresults picks a different subset on each call. After this fix, files always appear in the order determined by the sort parameter, making repeated calls with the same parameters return the same results.
Searches return fewer matching lines per file than expected
When a symbol is referenced many times in a file, the API returns only some of those references. A symbol used 25 times in a file shows up as 10 hits via the API but all 25 via the HTML page, with no indication in the response that lines were dropped. This happens because an internal server-side limit kicks in whenever a query matches more than 100 documents in total, silently capping the per-file hit count. After this fix, the default behavior returns all matching lines per file, consistent with the HTML page.
Performance implications. Removing the per-file cap can significantly increase response size for broad queries over large codebases — a query matching thousands of files, each with dozens of hits, will now return all of them. The new maxhitsperfile parameter (default 0 = unlimited) lets callers trade completeness for bounded response size: maxhitsperfile=10 caps each file at 10 hits regardless of total match count. The existing maxresults bounds the number of files returned; maxhitsperfile bounds the lines per file. Using both gives full control over response size with predictable upper bounds.