Vector Search support by JKamsker · Pull Request #2646 · litedb-org/LiteDB

JKamsker · 2025-09-24T12:28:20Z

This pull request introduces vector similarity search and benchmarking capabilities to the LiteDB project, along with new CLI tooling for document ingestion and search using embeddings. The main changes include adding vector fields to benchmark models, implementing benchmarks for vector similarity queries, and creating extensible CLI commands for ingesting and searching documents with vector embeddings. The changes are grouped below by theme.

Vector Similarity Benchmarking:

Added a new benchmark class QueryWithVectorSimilarity to LiteDB.Benchmarks/Benchmarks/Queries/QueryWithVectorSimilarity.cs, enabling performance testing of vector similarity queries (both indexed and unindexed) using the new Vectors field.
Updated the FileMetaBase model to include a float[] Vectors property, supporting storage of vector embeddings for similarity search.
Modified the FileMetaGenerator to generate random 128-dimensional vectors for each document, ensuring benchmark data includes vector embeddings. [1] [2]

Vector Search CLI Tooling:

Implemented the IngestCommand in LiteDB.Demo.Tools.VectorSearch/Commands/IngestCommand.cs, which ingests documents from a directory, splits them into chunks, generates embeddings, and stores them in a database with chunk-level vector indexing. Includes options for skipping unchanged files, pruning missing files, and progress reporting.
Added the SearchCommand in LiteDB.Demo.Tools.VectorSearch/Commands/SearchCommand.cs, allowing users to search for documents via semantic similarity using embeddings, with support for top-K results, max distance filtering, and customizable output.
Created a shared base class VectorSearchCommandSettings to handle configuration, authentication, and validation for embedding services and database paths in CLI commands.

Benchmarks

Credits

Thanks @hurley451 for the initial work

…im-function Support vector similarity operator in predicates

Fix vector serialization and comparison

…t-in-jsonwriter Handle BsonVector in JSON writer

…ngestcommand Fix vector index mapping for enumerable expressions

…and-chunk-processing ; Conflicts: ; .github/workflows/ci.yml ; LiteDB.sln ; LiteDB/Engine/Query/QueryOptimization.cs

Fix vector order lookup with composite ordering

…search-feature Improve vector index tests with MathNet comparisons

…odel-and-chunk-processing Introduce chunk-based vector search indexing

…nn-design Implement ANN graph for vector index

…port Integrate vector indexes into planner and tests

JKamsker · 2025-09-24T15:15:03Z

@codex review

Copilot

Pull Request Overview

This PR introduces comprehensive vector similarity search capabilities to LiteDB, implementing vector indexing, storage, and query functionality. The implementation includes a new HNSW-based vector index, BsonVector type, vector similarity operations, and supporting CLI tools for document ingestion and search using embeddings.

Adds new BsonVector type with VECTOR_SIM expression support for cosine distance calculations
Implements HNSW-based vector indexing with cosine, Euclidean, and dot product distance metrics
Creates comprehensive vector search API with WhereNear and TopKNear query methods
Includes CLI demo tools for embedding text documents using Google Gemini and performing semantic search

Reviewed Changes

Copilot reviewed 65 out of 66 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
LiteDB/Document/BsonVector.cs	New BsonVector type for storing float arrays as BSON values
LiteDB/Engine/Services/VectorIndexService.cs	Core HNSW vector index implementation with search, insert, and delete operations
LiteDB/Engine/Structures/VectorIndexNode.cs	Vector index node structure with multilevel neighbor management
LiteDB/Client/Database/LiteQueryable.cs	Vector query API with WhereNear and TopKNear methods
LiteDB.Demo.Tools.VectorSearch/	Complete CLI demo for document embedding and semantic search

Comments suppressed due to low confidence (1)

LiteDB/Engine/Pages/BasePage.cs:1

The error message contains grammatical errors. Should be: 'only data/index/vector index pages can be deleted'

using System;

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-24T15:16:48Z

-## License
-
-[MIT](http://opensource.org/licenses/MIT)
+</a>


[nitpick] The license section was removed from the README. Consider whether this was intentional or if the MIT license information should be preserved for clarity.

Suggested change

</a>

</a>

## License

LiteDB is licensed under the [MIT License](LICENSE).

Copyright (c) 2014-2024 Mauricio David

Copilot · 2025-09-24T15:16:49Z

-            {
-                sb.AppendLine($"WHERE {string.Join(" AND ", this.Where.Select(x => x.Source))}");
-            }
+


The WHERE clause logic was removed but only replaced with an empty line. This appears to be incomplete - the WHERE clause handling should be restored or properly integrated with the vector search logic.

It was moved further down

Copilot · 2025-09-24T15:16:49Z

+                    firstBlock = block.Position;
+                }
+
+                Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk);


Buffer.BlockCopy expects byte arrays but 'vector' is a float array. This should use Buffer.BlockCopy(vector, bytesWritten * sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk) or convert the float array to bytes first.

Suggested change

Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk);

Buffer.BlockCopy(vector, bytesWritten / sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk);

Copilot · 2025-09-24T15:16:49Z

+                    throw new LiteException(0, "Vector data block is corrupted.");
+                }
+
+                Buffer.BlockCopy(slice.Array, slice.Offset, vector, bytesCopied, available);


Buffer.BlockCopy is copying bytes to a float array. The destination should be byte array or use unsafe code to properly handle the float array memory layout.

Copilot · 2025-09-24T15:16:50Z


                case BsonType.MinValue: return BsonValue.MinValue;
                case BsonType.MaxValue: return BsonValue.MaxValue;
+                case BsonType.Vector: return buffer.ReadVector(offset);


The ReadVector method returns float[] but this method returns BsonValue. The return should be wrapped: return new BsonVector(buffer.ReadVector(offset));

Suggested change

case BsonType.Vector: return buffer.ReadVector(offset);

case BsonType.Vector: return new BsonVector(buffer.ReadVector(offset));

Copilot · 2025-09-24T15:16:50Z

+            {
+                page = this.GetPage<VectorIndexPage>(freeVectorPageList);
+
+                ENSURE(page.FreeBytes > bytesLength, "this page shout be space enouth for this new vector node");


Multiple spelling errors in the error message: 'shout' should be 'should' and 'enouth' should be 'enough'

chatgpt-codex-connector

Codex Review: Here are some suggestions.

LiteDB/LiteDB/Engine/Services/SnapShot.cs

Lines 680 to 688 in 99e8720

    
           foreach(var index in _collectionPage.GetCollectionIndexes()) 
        
           { 
        
               // add head/tail (same page) to be deleted 
        
               indexPages.Add(index.Head.PageID); 
        
               foreach (var node in indexer.FindAll(index, Query.Ascending)) 
        
               { 
        
                   indexPages.Add(node.Page.PageID);

[P1] Skip-list cleanup assumes vector indexes use IndexPage layout

When dropping a collection the engine iterates every entry returned by GetCollectionIndexes() and treats them as skip-list indexes by reading index.Head and calling indexer.FindAll(index, …) to collect IndexPage IDs. Vector indexes stored in the same dictionary have IndexType == 1 and never populate Head/Tail; their metadata lives in VectorIndexMetadata.Root and nodes are backed by VectorIndexPage. Passing a vector index through this block either throws when index.Head.PageID is uint.MaxValue or simply skips releasing vector pages, leaving allocated VectorIndexPage/data blocks behind. The loop needs to exclude vector indexes or invoke a dedicated release routine that walks VectorIndexMetadata.Root instead of the skip-list path.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

…ector-index-feature' into temp/pr-merge-21-24

…ector-index-feature-sc7x7j' into temp/pr-merge-21-24

…ector-index-feature-3uorv1' into temp/pr-merge-21-24

…ector-index-feature-hhom1x' into temp/pr-merge-21-24

Vector index bug

…ated-interfaces-and-logic Move vector APIs to LiteDB.Vector extension surface

syan2018 · 2025-09-30T07:21:41Z

Hello! would these new features release on LiteDB 6.0? and can I tried it on nuget 6.0.0-prerelease?

JKamsker · 2025-09-30T07:59:14Z

Hello! would these new features release on LiteDB 6.0? and can I tried it on nuget 6.0.0-prerelease?

@syan2018 Will be on nuget in 15-30 minutes 👍

JKamsker · 2025-09-30T08:00:28Z

@syan2018 https://www.nuget.org/packages/LiteDB/6.0.0-prerelease.52

JKamsker added 30 commits September 20, 2025 22:33

Add vector support

956dc28

Merge branch 'dev' into vectors

5543ae6

Add vector similarity helpers to query API

5d6388a

Clarify vector similarity alias behavior

e038420

Remove useless aliases

bb6c81a

Merge pull request #3 from JKamsker/codex/implement-and-test-vector_s…

fff562b

…im-function Support vector similarity operator in predicates

Fix vector serialization and comparison

219d858

Merge pull request #4 from JKamsker/codex/review-pull-request-#2

9ceab3c

Fix vector serialization and comparison

Clarify vector similarity parser handling

491e1be

Parse VECTOR_SIM as function

abec43a

Merge pull request #5 from JKamsker/codex/review-pull-request-#2

8b004de

Fix vector serialization and comparison

Merge branch 'dev' into vectors

159b684

Add test results upload step to CI workflow

f217a76

Integrate vector indexes into planner and tests

c84ff87

Add persistent vector index storage

c51ae87

Add search-pruning vector index structure

db3f2a7

Honor dot-product thresholds in vector search

411e24d

Implement ANN graph for vector index

282d8e5

Handle BsonVector in JSON writer

62b7d47

Merge pull request #14 from JKamsker/codex/add-bsontype.vector-suppor…

cae4a4f

…t-in-jsonwriter Handle BsonVector in JSON writer

Add vector search tool

f1d7f03

Searchtool fails

d9c532f

Fix vector index mapping for enumerable expressions

088dca6

Support external storage for large vectors

c52c4a1

Fix spectre escaping

9fcc9c3

Merge pull request #17 from JKamsker/codex/fix-embedding-storage-in-i…

241eb0e

…ngestcommand Fix vector index mapping for enumerable expressions

Add chunk-based document indexing

7c48f15

Merge branch 'dev-staging' into codex/add-indexeddocumentchunk-model-…

f98f830

…and-chunk-processing ; Conflicts: ; .github/workflows/ci.yml ; LiteDB.sln ; LiteDB/Engine/Query/QueryOptimization.cs

Add test for vector ordering with composite sort

2713b32

Merge pull request #19 from JKamsker/codex/fix-build-issues-after-merge

64bb18c

Fix vector order lookup with composite ordering

JKamsker added 4 commits September 24, 2025 09:34

Merge pull request #20 from JKamsker/codex/add-unit-tests-for-vector-…

fa69b08

…search-feature Improve vector index tests with MathNet comparisons

Merge pull request #18 from JKamsker/codex/add-indexeddocumentchunk-m…

2f730a6

…odel-and-chunk-processing Introduce chunk-based vector search indexing

Merge pull request #12 from JKamsker/codex/refactor-vectorindex-for-a…

543126a

…nn-design Implement ANN graph for vector index

Merge pull request #11 from JKamsker/codex/add-vector-aware-index-sup…

99e8720

…port Integrate vector indexes into planner and tests

JKamsker requested a review from Copilot September 24, 2025 15:14

Copilot AI reviewed Sep 24, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Sep 24, 2025

View reviewed changes

JKamsker added 14 commits September 25, 2025 19:45

Add regression tests for vector index regressions

e6e3d5d

Add regression tests for vector index issues

05f9f5a

Add regression tests for vector index data and drop cleanup

d21d9ab

Add regression tests for vector index issues

0555167

Merge remote-tracking branch 'origin/codex/add-regression-tests-for-v…

1c68232

…ector-index-feature' into temp/pr-merge-21-24

Merge remote-tracking branch 'origin/codex/add-regression-tests-for-v…

5e94119

…ector-index-feature-sc7x7j' into temp/pr-merge-21-24

Merge remote-tracking branch 'origin/codex/add-regression-tests-for-v…

c46f040

…ector-index-feature-3uorv1' into temp/pr-merge-21-24

Merge remote-tracking branch 'origin/codex/add-regression-tests-for-v…

2f4798a

…ector-index-feature-hhom1x' into temp/pr-merge-21-24

Fix snapshot

16a8ed7

Remove wrong tests

65aa6e0

Merge pull request #28 from JKamsker/vector_index_bug

70978ef

Vector index bug

Move vector APIs to LiteDB.Vector extensions

d053eaa

Merge pull request #33 from JKamsker/codex/refactor-litedb-vector-rel…

a3f2254

…ated-interfaces-and-logic Move vector APIs to LiteDB.Vector extension surface

Improve demo doc

c477ec7

JKamsker changed the title ~~Full vector support~~ Vector Search support Sep 27, 2025

Fix markdown in demo tool

e1f6548

JKamsker closed this Sep 30, 2025

JKamsker deleted the vectors branch September 30, 2025 07:50

JKamsker mentioned this pull request Sep 30, 2025

feat: Vector Search and Similarity Indexing #2678

Merged

-</a>
+</a>
+## License
+LiteDB is licensed under the [MIT License](LICENSE).
+Copyright (c) 2014-2024 Mauricio David

	Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk);
	Buffer.BlockCopy(vector, bytesWritten / sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk);

	case BsonType.Vector: return buffer.ReadVector(offset);
	case BsonType.Vector: return new BsonVector(buffer.ReadVector(offset));

	foreach(var index in _collectionPage.GetCollectionIndexes())
	{
	// add head/tail (same page) to be deleted
	indexPages.Add(index.Head.PageID);

	foreach (var node in indexer.FindAll(index, Query.Ascending))
	{
	indexPages.Add(node.Page.PageID);

Conversation

JKamsker commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Credits

Uh oh!

JKamsker commented Sep 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

JKamsker Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

Uh oh!

syan2018 commented Sep 30, 2025

Uh oh!

JKamsker commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JKamsker commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JKamsker commented Sep 24, 2025 •

edited

Loading

JKamsker commented Sep 30, 2025 •

edited

Loading