Conversation
…im-function Support vector similarity operator in predicates
Fix vector serialization and comparison
Fix vector serialization and comparison
…t-in-jsonwriter Handle BsonVector in JSON writer
…ngestcommand Fix vector index mapping for enumerable expressions
…and-chunk-processing ; Conflicts: ; .github/workflows/ci.yml ; LiteDB.sln ; LiteDB/Engine/Query/QueryOptimization.cs
Fix vector order lookup with composite ordering
…search-feature Improve vector index tests with MathNet comparisons
…odel-and-chunk-processing Introduce chunk-based vector search indexing
…nn-design Implement ANN graph for vector index
…port Integrate vector indexes into planner and tests
|
@codex review |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces comprehensive vector similarity search capabilities to LiteDB, implementing vector indexing, storage, and query functionality. The implementation includes a new HNSW-based vector index, BsonVector type, vector similarity operations, and supporting CLI tools for document ingestion and search using embeddings.
- Adds new BsonVector type with VECTOR_SIM expression support for cosine distance calculations
- Implements HNSW-based vector indexing with cosine, Euclidean, and dot product distance metrics
- Creates comprehensive vector search API with WhereNear and TopKNear query methods
- Includes CLI demo tools for embedding text documents using Google Gemini and performing semantic search
Reviewed Changes
Copilot reviewed 65 out of 66 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| LiteDB/Document/BsonVector.cs | New BsonVector type for storing float arrays as BSON values |
| LiteDB/Engine/Services/VectorIndexService.cs | Core HNSW vector index implementation with search, insert, and delete operations |
| LiteDB/Engine/Structures/VectorIndexNode.cs | Vector index node structure with multilevel neighbor management |
| LiteDB/Client/Database/LiteQueryable.cs | Vector query API with WhereNear and TopKNear methods |
| LiteDB.Demo.Tools.VectorSearch/ | Complete CLI demo for document embedding and semantic search |
Comments suppressed due to low confidence (1)
LiteDB/Engine/Pages/BasePage.cs:1
- The error message contains grammatical errors. Should be: 'only data/index/vector index pages can be deleted'
using System;
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| ## License | ||
|
|
||
| [MIT](http://opensource.org/licenses/MIT) | ||
| </a> No newline at end of file |
There was a problem hiding this comment.
[nitpick] The license section was removed from the README. Consider whether this was intentional or if the MIT license information should be preserved for clarity.
| </a> | |
| </a> | |
| ## License | |
| LiteDB is licensed under the [MIT License](LICENSE). | |
| Copyright (c) 2014-2024 Mauricio David |
| { | ||
| sb.AppendLine($"WHERE {string.Join(" AND ", this.Where.Select(x => x.Source))}"); | ||
| } | ||
|
|
There was a problem hiding this comment.
The WHERE clause logic was removed but only replaced with an empty line. This appears to be incomplete - the WHERE clause handling should be restored or properly integrated with the vector search logic.
There was a problem hiding this comment.
It was moved further down
| firstBlock = block.Position; | ||
| } | ||
|
|
||
| Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk); |
There was a problem hiding this comment.
Buffer.BlockCopy expects byte arrays but 'vector' is a float array. This should use Buffer.BlockCopy(vector, bytesWritten * sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk) or convert the float array to bytes first.
| Buffer.BlockCopy(vector, bytesWritten, block.Buffer.Array, block.Buffer.Offset, chunk); | |
| Buffer.BlockCopy(vector, bytesWritten / sizeof(float), block.Buffer.Array, block.Buffer.Offset, chunk); |
| throw new LiteException(0, "Vector data block is corrupted."); | ||
| } | ||
|
|
||
| Buffer.BlockCopy(slice.Array, slice.Offset, vector, bytesCopied, available); |
There was a problem hiding this comment.
Buffer.BlockCopy is copying bytes to a float array. The destination should be byte array or use unsafe code to properly handle the float array memory layout.
|
|
||
| case BsonType.MinValue: return BsonValue.MinValue; | ||
| case BsonType.MaxValue: return BsonValue.MaxValue; | ||
| case BsonType.Vector: return buffer.ReadVector(offset); |
There was a problem hiding this comment.
The ReadVector method returns float[] but this method returns BsonValue. The return should be wrapped: return new BsonVector(buffer.ReadVector(offset));
| case BsonType.Vector: return buffer.ReadVector(offset); | |
| case BsonType.Vector: return new BsonVector(buffer.ReadVector(offset)); |
| { | ||
| page = this.GetPage<VectorIndexPage>(freeVectorPageList); | ||
|
|
||
| ENSURE(page.FreeBytes > bytesLength, "this page shout be space enouth for this new vector node"); |
There was a problem hiding this comment.
Multiple spelling errors in the error message: 'shout' should be 'should' and 'enouth' should be 'enough'
There was a problem hiding this comment.
Codex Review: Here are some suggestions.
LiteDB/LiteDB/Engine/Services/SnapShot.cs
Lines 680 to 688 in 99e8720
[P1] Skip-list cleanup assumes vector indexes use IndexPage layout
When dropping a collection the engine iterates every entry returned by GetCollectionIndexes() and treats them as skip-list indexes by reading index.Head and calling indexer.FindAll(index, …) to collect IndexPage IDs. Vector indexes stored in the same dictionary have IndexType == 1 and never populate Head/Tail; their metadata lives in VectorIndexMetadata.Root and nodes are backed by VectorIndexPage. Passing a vector index through this block either throws when index.Head.PageID is uint.MaxValue or simply skips releasing vector pages, leaving allocated VectorIndexPage/data blocks behind. The loop needs to exclude vector indexes or invoke a dedicated release routine that walks VectorIndexMetadata.Root instead of the skip-list path.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
…ector-index-feature' into temp/pr-merge-21-24
…ector-index-feature-sc7x7j' into temp/pr-merge-21-24
…ector-index-feature-3uorv1' into temp/pr-merge-21-24
…ector-index-feature-hhom1x' into temp/pr-merge-21-24
Vector index bug
…ated-interfaces-and-logic Move vector APIs to LiteDB.Vector extension surface
|
Hello! would these new features release on LiteDB 6.0? and can I tried it on nuget 6.0.0-prerelease? |
@syan2018 Will be on nuget in 15-30 minutes 👍 |
Fixes: #2364 and #2580
This pull request introduces vector similarity search and benchmarking capabilities to the LiteDB project, along with new CLI tooling for document ingestion and search using embeddings. The main changes include adding vector fields to benchmark models, implementing benchmarks for vector similarity queries, and creating extensible CLI commands for ingesting and searching documents with vector embeddings. The changes are grouped below by theme.
Vector Similarity Benchmarking:
QueryWithVectorSimilaritytoLiteDB.Benchmarks/Benchmarks/Queries/QueryWithVectorSimilarity.cs, enabling performance testing of vector similarity queries (both indexed and unindexed) using the newVectorsfield.FileMetaBasemodel to include afloat[] Vectorsproperty, supporting storage of vector embeddings for similarity search.FileMetaGeneratorto generate random 128-dimensional vectors for each document, ensuring benchmark data includes vector embeddings. [1] [2]Vector Search CLI Tooling:
IngestCommandinLiteDB.Demo.Tools.VectorSearch/Commands/IngestCommand.cs, which ingests documents from a directory, splits them into chunks, generates embeddings, and stores them in a database with chunk-level vector indexing. Includes options for skipping unchanged files, pruning missing files, and progress reporting.SearchCommandinLiteDB.Demo.Tools.VectorSearch/Commands/SearchCommand.cs, allowing users to search for documents via semantic similarity using embeddings, with support for top-K results, max distance filtering, and customizable output.VectorSearchCommandSettingsto handle configuration, authentication, and validation for embedding services and database paths in CLI commands.Benchmarks
Credits
Thanks @hurley451 for the initial work