-
-
Notifications
You must be signed in to change notification settings - Fork 6.3k
feat(search): support code search by zoekt #33850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(search): support code search by zoekt #33850
Conversation
|
There are already so many search engines builtin into Gitea. Many of them have various bugs. So the questions are:
|
|
To be honest I prefer this zoekt search engine compared to the existing search engine |
|
maybe this can replace bleve but we need some comparsion tests. |
That's understandable. So a few months later, another one feels "yoekt" is better, then introduce "yoekt", then a few months later, someone feels "xoekt" is better, then introduce "xoekt", and then "woekt", "voekt", "uoekt" ... "coekt", "boekt", "aoekt". Then Gitea contains all search engines on the internet. I do not mean objection to introduce improvements. But actually it needs to:
So a clear roadmap about the "search engine plan" is necessary. |
In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search
I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered. |
Well, do you know how many search engines are in Gitea now? And what longstanding bugs do they have? https://github.com/go-gitea/gitea/issues?q=is%3Aissue%20state%3Aopen%20code%20search And some bugs didn't get fixed in months, for example: "Search Functionality Issues with Bleve Engine #31565", I don't see "good community maintenance" |
you don't need to worry about this: zoekt is a popular code search engine, currently used by code platforms like Gerrit, Sourcegraph, and GitLab, wrote by Gerrit author, and maintained by Sourcegraph. Zoekt has advantages that traditional search engines (like ES) do not possess: support for regex matching, substring search, etc. I don't think any new open-source code search engines will be able to replace it in the short term.
You are right, where should the roadmap be written? I don't have experience with this. I will supplement its documentation when the zoekt functionality is more complete |
Yep, if zoekt wins, we need to drop some others. |
Sure, it's regrettable that this part of the content is unmaintained. However, for the zoekt code search, I can commit to maintaining it thoroughly. |
Yeah, I hope this can be divided into at least two steps:
Zoekt may also have some issues, as GitLab has not completely deprecated ES and fully switched to Zoekt... |
17d7c30 to
212fc79
Compare
|
To make the code clear, we need to refactor the related code first: Refactor issue & code search #33860 Each "indexer" should provide the "search modes" they support by themselves. And we need to remove the "fuzzy" search for code. |
|
Please note that I have many other commitments over the next two weeks and may only be able to dedicate time to this MR in a couple of weeks |
783ee0e to
374ce10
Compare
850a16a to
86ef977
Compare
86ef977 to
9906c5f
Compare
82d0d38 to
e1ae522
Compare
|
@lunny @wxiaoguang I simply updated to the latest version of zoekt, and it seems to be working properly now. Could you please review the code again? Thank you. |
|
I'm not sure if I'm suited to review. I don't use the "search" feature so don't understand how end users really use it. And I don't know how these search engines work the same way, don't understand zoekt, don't see answers for my questions / concerns (for example: https://github.com/go-gitea/gitea/pull/33850/files#r2046765133). And it seems there is no test. |
|
Any chances to have this in gitea ? |
|
Somebody review this and get this approved please! |
|
There are still some missing in this PR.
|
I'll go to the zoekt community to ask if version management can be supported.
I think these are all harmless minor TODOs that we could perhaps deploy first and address later.
Agree, will change.
I will investigate this issue. |
I find this bug: sourcegraph/zoekt#1001, golang/go#75361 |
You don’t need Zoekt to provide a version check. Similar to how Bleve’s versioning is managed directly by Gitea, we can rely on Gitea itself to track whether the index format has changed based on release notes during upgrades. Of course, it would still be beneficial if Zoekt exposed its own format version, but it’s not strictly required. |
modules/indexer/code/zoekt/zoekt.go
Outdated
| return err | ||
| } | ||
|
|
||
| repoPathPrefix := repo.OwnerName + "%2F" + repo.Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will leave many garage when rename or move a repository? Why it's not work to just use repository's ID as file name or directory? And since there will too many files, it's better to have a two level directories to not have too many files in one directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it briefly, this might be related to zoekt's ability to search using the repo:foo/bar syntax. However, using repository ID might be better here, as it would be more friendly for deletion/move operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we can simply organize index file into a directory structure, because zoekt might need to traverse and search the index files under the single directory. However, I will ask about this in the zoekt community.
modules/indexer/code/zoekt/zoekt.go
Outdated
| finalQuery = query.NewAnd(finalQuery, langQuery) | ||
| } | ||
|
|
||
| // TODO: NEEDWORK: IncludePatterns/ExcludePatterns are glob patterns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these comments are necessary. They can be part of future PRs. Or it can be recorded in the content of the PR as a next step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it.
Alright, then we can track it through the Changelog you mentioned. However, I will push forward the zoekt support for that version as soon as possible. |
e1ae522 to
818dc53
Compare
|
I think we can have a gitea-internal version about the zoekt data format like we did for bleve like this https://github.com/go-gitea/gitea/blob/main/modules/indexer/code/bleve/bleve.go#L75 |
Oh, I misunderstood what you meant. I thought you were referring to the git tag version, but it turns out you were talking about the data format version. |
Signed-off-by: ZheNing Hu <[email protected]>
818dc53 to
a78e276
Compare
I looked into it, and the zoekt index version number is maintained here: Also, zoekt has implemented a backward compatibility mechanism for data: In short: if the data version is updated, zoekt will reindex. |
|
Don't mind me. I am just glad to see this getting traction. |
The data format has two layers of meaning: one defines how Zoekt stores the data, and the other defines how Gitea populates those fields. Because of this, we still need a version indicator. When Gitea adds new fields or modifies existing ones—such as changes to the Repo ID or other attributes—we need a clear way to detect these changes and determine that the indexes must be rebuilt. |
In that case, we may need to store this version in the database. I'll investigate how to implement it. |
Abstract
Zoekt is an open-source search engine specifically designed for code search, utilizing 3-gram indexing for efficient segmentation. By replacing Elasticsearch/Bleve with Zoekt, it provides Gitea with precise code search capabilities and support for regular expression searches.
Motivation
The existing code search functionality is implemented using Elasticsearch/bleve. Although Elasticsearch/bleve excels in general search domains, its disadvantages in code search are obvious:
Proposal
Goals
Support precise substring searches
Support regex searches
Non-Goals
Support multi-branch searches
Support code symbol syntax searches
Competitive Product Analysis
Design
Index
Since Zoekt is written in Golang, its API can be directly integrated through its Go package using indexBuilder.Add() and indexBuilder.MarkFileAsChangedOrRemoved() to add or remove indexed files. The fundamental processes for implementing full and incremental repository indexing in Zoekt do not differ significantly from those in Elasticsearch (ES) or Bleve.
Search
We can use shards.NewDirectorySearcher() or shards.NewDirectorySearcherFast() to build a searcher for searching. The search modes will support:
Since the search is currently limited to a single repository, we will retrieve all the content first and then handle pagination.
Use Method
enable this in app.ini
Resource Usage
Building the index in Zoekt requires 1.2 times the corpus size in RAM, and the index storage size is about three times the corpus size. Maybe we should expose some of Zoekt's internal Prometheus metrics in the future?
Exists Issues
Try to support #33702