Skip to content

Conversation

@adlternative
Copy link

@adlternative adlternative commented Mar 11, 2025

Abstract

Zoekt is an open-source search engine specifically designed for code search, utilizing 3-gram indexing for efficient segmentation. By replacing Elasticsearch/Bleve with Zoekt, it provides Gitea with precise code search capabilities and support for regular expression searches.

Motivation

The existing code search functionality is implemented using Elasticsearch/bleve. Although Elasticsearch/bleve excels in general search domains, its disadvantages in code search are obvious:

  1. Unable to support precise match searches, for example, when punctuation marks appear in the search criteria.
  2. Unable to easily support regex match searches.

Proposal

Goals

Support precise substring searches
Support regex searches

Non-Goals

Support multi-branch searches
Support code symbol syntax searches

Competitive Product Analysis

Platform Search Engine Supports Regex Search Supports Full Repository Search
GitHub Blackbird (Proprietary)
GitLab Elasticsearch / Zoekt
grep.app Closed Source
Sourcegraph Zoekt
Gitea(us) Elasticsearch or Bleve

Design

Index

Since Zoekt is written in Golang, its API can be directly integrated through its Go package using indexBuilder.Add() and indexBuilder.MarkFileAsChangedOrRemoved() to add or remove indexed files. The fundamental processes for implementing full and incremental repository indexing in Zoekt do not differ significantly from those in Elasticsearch (ES) or Bleve.

Search

We can use shards.NewDirectorySearcher() or shards.NewDirectorySearcherFast() to build a searcher for searching. The search modes will support:

  • exact – Complete match of any content (including punctuation)
image
  • words – Split by spaces into multiple search conditions and perform an OR query
image
  • regexp – Regular expression search
image
  • zoekt – Using the Zoekt search syntax
image

Since the search is currently limited to a single repository, we will retrieve all the content first and then handle pagination.

Use Method

enable this in app.ini

[indexer]
REPO_INDEXER_TYPE = zoekt
REPO_INDEXER_ENABLED = true
REPO_INDEXER_PATH = indexers/repos.zoekt

Resource Usage

Building the index in Zoekt requires 1.2 times the corpus size in RAM, and the index storage size is about three times the corpus size. Maybe we should expose some of Zoekt's internal Prometheus metrics in the future?

Exists Issues

Try to support #33702

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 11, 2025
@github-actions github-actions bot added modifies/go Pull requests that update Go code modifies/dependencies labels Mar 11, 2025
@adlternative adlternative changed the title WIP feat(search): support code search by zoekt WIP: feat(search): support code search by zoekt Mar 11, 2025
@wxiaoguang
Copy link
Contributor

There are already so many search engines builtin into Gitea. Many of them have various bugs.

So the questions are:

  1. Will more search engines be added into Gitea to make Gitea have plenty of builtin search engines?
  2. Will the search engines become unmaintained and the bugs will never be fixed?

@hiifong
Copy link
Member

hiifong commented Mar 11, 2025

To be honest I prefer this zoekt search engine compared to the existing search engine

@lunny
Copy link
Member

lunny commented Mar 11, 2025

maybe this can replace bleve but we need some comparsion tests.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Mar 11, 2025

To be honest I prefer this zoekt search engine compared to the existing search engine

That's understandable. So a few months later, another one feels "yoekt" is better, then introduce "yoekt", then a few months later, someone feels "xoekt" is better, then introduce "xoekt", and then "woekt", "voekt", "uoekt" ... "coekt", "boekt", "aoekt". Then Gitea contains all search engines on the internet.


I do not mean objection to introduce improvements. But actually it needs to:

  1. Clarify the existing problems & fix existing problems.
  2. Remove unnecessary search engine before introducing new ones.

So a clear roadmap about the "search engine plan" is necessary.

@wxiaoguang wxiaoguang marked this pull request as draft March 11, 2025 05:07
@adlternative
Copy link
Author

There are already so many search engines builtin into Gitea. Many of them have various bugs.

So the questions are:

  1. Will more search engines be added into Gitea to make Gitea have plenty of builtin search engines?

In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search

  1. Will the search engines become unmaintained and the bugs will never be fixed?

I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Mar 11, 2025

In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search
I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered.

Well, do you know how many search engines are in Gitea now? And what longstanding bugs do they have? https://github.com/go-gitea/gitea/issues?q=is%3Aissue%20state%3Aopen%20code%20search

And some bugs didn't get fixed in months, for example: "Search Functionality Issues with Bleve Engine #31565", I don't see "good community maintenance"

@adlternative
Copy link
Author

To be honest I prefer this zoekt search engine compared to the existing search engine

That's understandable. So a few months later, another one feels "yoekt" is better, then introduce "yoekt", then a few months later, someone feels "xoekt" is better, then introduce "xoekt", and then "woekt", "voekt", "uoekt" ... "coekt", "boekt", "aoekt". Then Gitea contains all search engines on the internet.

you don't need to worry about this: zoekt is a popular code search engine, currently used by code platforms like Gerrit, Sourcegraph, and GitLab, wrote by Gerrit author, and maintained by Sourcegraph. Zoekt has advantages that traditional search engines (like ES) do not possess: support for regex matching, substring search, etc. I don't think any new open-source code search engines will be able to replace it in the short term.

I do not mean objection to introduce improvements. But actually it needs to:

  1. Clarify the existing problems & fix existing problems.
  2. Remove unnecessary search engine before introducing new ones.

So a clear roadmap about the "search engine plan" is necessary.

You are right, where should the roadmap be written? I don't have experience with this. I will supplement its documentation when the zoekt functionality is more complete

@wxiaoguang
Copy link
Contributor

I don't think any new open-source code search engines will be able to replace it in the short term.

Yep, if zoekt wins, we need to drop some others.

@adlternative
Copy link
Author

In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search
I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered.

Well, do you know how many search engines are in Gitea now? And what longstanding bugs do they have? https://github.com/go-gitea/gitea/issues?q=is%3Aissue%20state%3Aopen%20code%20search

And some bugs didn't get fixed in months, for example: "Search Functionality Issues with Bleve Engine #31565", I don't see "good community maintenance"

Sure, it's regrettable that this part of the content is unmaintained. However, for the zoekt code search, I can commit to maintaining it thoroughly.

@adlternative
Copy link
Author

I don't think any new open-source code search engines will be able to replace it in the short term.

Yep, if zoekt wins, we need to drop some others.

Yeah, I hope this can be divided into at least two steps:

  1. Support zoekt
  2. Deprecate other search engines

Zoekt may also have some issues, as GitLab has not completely deprecated ES and fully switched to Zoekt...

@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch from 17d7c30 to 212fc79 Compare March 11, 2025 11:16
@wxiaoguang
Copy link
Contributor

To make the code clear, we need to refactor the related code first: Refactor issue & code search #33860

Each "indexer" should provide the "search modes" they support by themselves. And we need to remove the "fuzzy" search for code.

@adlternative
Copy link
Author

Please note that I have many other commitments over the next two weeks and may only be able to dedicate time to this MR in a couple of weeks

@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch 3 times, most recently from 783ee0e to 374ce10 Compare April 5, 2025 10:56
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Apr 5, 2025
@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch 3 times, most recently from 850a16a to 86ef977 Compare April 5, 2025 12:15
@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch from 86ef977 to 9906c5f Compare April 5, 2025 12:24
@adlternative adlternative changed the title WIP: feat(search): support code search by zoekt feat(search): support code search by zoekt Apr 6, 2025
@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch 3 times, most recently from 82d0d38 to e1ae522 Compare November 8, 2025 08:08
@adlternative
Copy link
Author

@lunny @wxiaoguang I simply updated to the latest version of zoekt, and it seems to be working properly now. Could you please review the code again? Thank you.

@GiteaBot GiteaBot added lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Nov 17, 2025
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Nov 18, 2025

I'm not sure if I'm suited to review. I don't use the "search" feature so don't understand how end users really use it. And I don't know how these search engines work the same way, don't understand zoekt, don't see answers for my questions / concerns (for example: https://github.com/go-gitea/gitea/pull/33850/files#r2046765133). And it seems there is no test.

@wxiaoguang wxiaoguang removed their request for review November 18, 2025 01:09
@kvaster
Copy link
Contributor

kvaster commented Nov 28, 2025

Any chances to have this in gitea ?

@seamon67
Copy link

Somebody review this and get this approved please!

@lunny
Copy link
Member

lunny commented Nov 29, 2025

There are still some missing in this PR.

  • There’s no version tracking for zoekt similar to what bleve provides. This will make major version upgrades difficult to manage.
  • There are still many TODOs and commented-out code left in the PR.
  • modules/setting/indexer.go:81 the default indexer path should be changed according to it's bleve or zoekt.
  • It seems zoekt needs to be upgrade because it will panic in Go1.25.4

@adlternative
Copy link
Author

There are still some missing in this PR.

  • There’s no version tracking for zoekt similar to what bleve provides. This will make major version upgrades difficult to manage.

I'll go to the zoekt community to ask if version management can be supported.

  • There are still many TODOs and commented-out code left in the PR.

I think these are all harmless minor TODOs that we could perhaps deploy first and address later.

  • modules/setting/indexer.go:81 the default indexer path should be changed according to it's bleve or zoekt.

Agree, will change.

  • It seems zoekt needs to be upgrade because it will panic in Go1.25.4

I will investigate this issue.

@adlternative
Copy link
Author

adlternative commented Dec 2, 2025

There are still some missing in this PR.

  • There’s no version tracking for zoekt similar to what bleve provides. This will make major version upgrades difficult to manage.
  • There are still many TODOs and commented-out code left in the PR.
  • modules/setting/indexer.go:81 the default indexer path should be changed according to it's bleve or zoekt.
  • It seems zoekt needs to be upgrade because it will panic in Go1.25.4

I find this bug: sourcegraph/zoekt#1001, golang/go#75361
zoekt with jsonv2 will trigger this stackoverflow panic bug,
I am submitting a PR to fix it.

@lunny
Copy link
Member

lunny commented Dec 2, 2025

There are still some missing in this PR.

  • There’s no version tracking for zoekt similar to what bleve provides. This will make major version upgrades difficult to manage.

I'll go to the zoekt community to ask if version management can be supported.

  • There are still many TODOs and commented-out code left in the PR.

I think these are all harmless minor TODOs that we could perhaps deploy first and address later.

  • modules/setting/indexer.go:81 the default indexer path should be changed according to it's bleve or zoekt.

Agree, will change.

  • It seems zoekt needs to be upgrade because it will panic in Go1.25.4

I will investigate this issue.

You don’t need Zoekt to provide a version check. Similar to how Bleve’s versioning is managed directly by Gitea, we can rely on Gitea itself to track whether the index format has changed based on release notes during upgrades. Of course, it would still be beneficial if Zoekt exposed its own format version, but it’s not strictly required.

return err
}

repoPathPrefix := repo.OwnerName + "%2F" + repo.Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will leave many garage when rename or move a repository? Why it's not work to just use repository's ID as file name or directory? And since there will too many files, it's better to have a two level directories to not have too many files in one directory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it briefly, this might be related to zoekt's ability to search using the repo:foo/bar syntax. However, using repository ID might be better here, as it would be more friendly for deletion/move operations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can simply organize index file into a directory structure, because zoekt might need to traverse and search the index files under the single directory. However, I will ask about this in the zoekt community.

finalQuery = query.NewAnd(finalQuery, langQuery)
}

// TODO: NEEDWORK: IncludePatterns/ExcludePatterns are glob patterns,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these comments are necessary. They can be part of future PRs. Or it can be recorded in the content of the PR as a next step.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it.

@adlternative
Copy link
Author

There are still some missing in this PR.

  • There’s no version tracking for zoekt similar to what bleve provides. This will make major version upgrades difficult to manage.

I'll go to the zoekt community to ask if version management can be supported.

  • There are still many TODOs and commented-out code left in the PR.

I think these are all harmless minor TODOs that we could perhaps deploy first and address later.

  • modules/setting/indexer.go:81 the default indexer path should be changed according to it's bleve or zoekt.

Agree, will change.

  • It seems zoekt needs to be upgrade because it will panic in Go1.25.4

I will investigate this issue.

You don’t need Zoekt to provide a version check. Similar to how Bleve’s versioning is managed directly by Gitea, we can rely on Gitea itself to track whether the index format has changed based on release notes during upgrades. Of course, it would still be beneficial if Zoekt exposed its own format version, but it’s not strictly required.

Alright, then we can track it through the Changelog you mentioned. However, I will push forward the zoekt support for that version as soon as possible.

sourcegraph/zoekt#1000

@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch from e1ae522 to 818dc53 Compare December 5, 2025 14:28
@lunny
Copy link
Member

lunny commented Dec 5, 2025

sourcegraph/zoekt#1000

I think we can have a gitea-internal version about the zoekt data format like we did for bleve like this https://github.com/go-gitea/gitea/blob/main/modules/indexer/code/bleve/bleve.go#L75

@adlternative
Copy link
Author

sourcegraph/zoekt#1000

I think we can have a gitea-internal version about the zoekt data format like we did for bleve like this https://github.com/go-gitea/gitea/blob/main/modules/indexer/code/bleve/bleve.go#L75

Oh, I misunderstood what you meant. I thought you were referring to the git tag version, but it turns out you were talking about the data format version.

@adlternative adlternative force-pushed the adl/dev/search/support-zoekt-code-indexer branch from 818dc53 to a78e276 Compare December 7, 2025 14:56
@adlternative
Copy link
Author

sourcegraph/zoekt#1000

I think we can have a gitea-internal version about the zoekt data format like we did for bleve like this https://github.com/go-gitea/gitea/blob/main/modules/indexer/code/bleve/bleve.go#L75

I looked into it, and the zoekt index version number is maintained here:
https://github.com/sourcegraph/zoekt/blob/main/index/toc.go#L31

Also, zoekt has implemented a backward compatibility mechanism for data:
https://github.com/sourcegraph/zoekt/blob/886b229dcd5e7bec0c9918002b77345d27c84e3c/index/builder.go#L397

In short: if the data version is updated, zoekt will reindex.

@seamon67
Copy link

seamon67 commented Dec 7, 2025

Don't mind me. I am just glad to see this getting traction.
Edit: @adlternative Don't give up. I am rooting for you!

@lunny
Copy link
Member

lunny commented Dec 7, 2025

sourcegraph/zoekt#1000

I think we can have a gitea-internal version about the zoekt data format like we did for bleve like this main/modules/indexer/code/bleve/bleve.go#L75

I looked into it, and the zoekt index version number is maintained here: sourcegraph/zoekt@main/index/toc.go#L31

Also, zoekt has implemented a backward compatibility mechanism for data: sourcegraph/zoekt@886b229/index/builder.go#L397

In short: if the data version is updated, zoekt will reindex.

The data format has two layers of meaning: one defines how Zoekt stores the data, and the other defines how Gitea populates those fields. Because of this, we still need a version indicator. When Gitea adds new fields or modifies existing ones—such as changes to the Repo ID or other attributes—we need a clear way to detect these changes and determine that the indexes must be rebuilt.

@adlternative
Copy link
Author

sourcegraph/zoekt#1000

我认为我们可以像对 bleve 那样,为 zoekt 数据格式创建一个 gitea 内部版本,例如main/modules/indexer/code/bleve/bleve.go#L75

我查了一下,zoekt 索引版本号维护在这里:sourcegraph/zoekt@ main/index/toc.go#L31
此外,zoekt 还实现了数据向后兼容机制:sourcegraph/zoekt@ 886b229/index/builder.go#L397
简而言之:如果数据版本更新,zoekt 将重新索引。

数据格式包含两层含义:一层定义了 Zoekt 如何存储数据,另一层定义了 Gitea 如何填充这些字段。因此,我们仍然需要一个版本指示器。当 Gitea 添加新字段或修改现有字段(例如更改仓库 ID 或其他属性)时,我们需要一种清晰的方法来检测这些更改,并确定是否需要重建索引。

sourcegraph/zoekt#1000

I think we can have a gitea-internal version about the zoekt data format like we did for bleve like this main/modules/indexer/code/bleve/bleve.go#L75

I looked into it, and the zoekt index version number is maintained here: sourcegraph/zoekt@main/index/toc.go#L31
Also, zoekt has implemented a backward compatibility mechanism for data: sourcegraph/zoekt@886b229/index/builder.go#L397
In short: if the data version is updated, zoekt will reindex.

The data format has two layers of meaning: one defines how Zoekt stores the data, and the other defines how Gitea populates those fields. Because of this, we still need a version indicator. When Gitea adds new fields or modifies existing ones—such as changes to the Repo ID or other attributes—we need a clear way to detect these changes and determine that the indexes must be rebuilt.

In that case, we may need to store this version in the database. I'll investigate how to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-update-needed The document needs to be updated synchronously lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. modifies/dependencies modifies/go Pull requests that update Go code modifies/translation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants