Conversation
There was a problem hiding this comment.
Pull Request Overview
This pull request introduces comprehensive vector similarity search and benchmarking capabilities to LiteDB, implementing vector indexes, similarity queries, and CLI tooling for document ingestion and search using embeddings. The changes significantly expand LiteDB's functionality beyond traditional document storage to include modern vector database features.
- Adds vector similarity search with support for multiple distance metrics (cosine, Euclidean, dot product)
- Implements hierarchical navigable small world (HNSW) vector index structures for efficient similarity queries
- Creates extensible CLI tools for document ingestion, chunking, embedding generation, and vector search operations
Reviewed Changes
Copilot reviewed 69 out of 70 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| LiteDB/Engine/Structures/VectorIndexNode.cs | Core vector index node implementation with multi-level neighbor management |
| LiteDB/Engine/Services/VectorIndexService.cs | Vector similarity search service with HNSW algorithm implementation |
| LiteDB/Document/BsonVector.cs | New BsonVector type for native vector data storage |
| LiteDB/Client/Vector/* | Client-side vector extensions for collections, queries, and repositories |
| LiteDB/Engine/Query/IndexQuery/VectorIndexQuery.cs | Query execution engine for vector similarity searches |
| tests.runsettings & tests.ci.runsettings | Test configuration files with timeout and CI optimization settings |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes: #2364 and #2580
Ref: #2646
This pull request introduces a major new capability to LiteDB: Vector Search. This feature enables storing vector embeddings and performing efficient Approximate Nearest Neighbor (ANN) searches, making LiteDB suitable for a wide range of AI and machine learning applications, such as semantic search, recommendation engines, and Retrieval-Augmented Generation (RAG).
Key Features
Vector Storage & Indexing:
BsonVectortype (float[]) is introduced for native and efficient storage of vector embeddings.VectorDistanceMetric.Cosine(default)VectorDistanceMetric.EuclideanVectorDistanceMetric.DotProductNew Fluent Query API:
LiteDB.Vectornamespace:.WhereNear(x => x.Embedding, targetVector, maxDistance): Filters documents where the vector is within a specified distance of the target..TopKNear(x => x.Embedding, targetVector, k): Efficiently retrieves the top K nearest neighbors to a target vector, using the index to avoid full scans and sorting.SQL /
BsonExpressionSupport:VECTOR_SIM(field, target_vector)function and infix operator have been added, allowing vector similarity calculations directly within expressions and SQL queries.SELECT * FROM docs WHERE $.Embedding VECTOR_SIM @0 <= 0.25End-to-End Demo Application
To showcase this new feature, a new demo project
LiteDB.Demo.Tools.VectorSearchhas been added. It's a command-line tool that provides a complete workflow for building a semantic search engine:.txt,.md), splits them into chunks, and generates embeddings using the Google Gemini API..TopKNear()to find and display the most semantically similar document chunks from the database.Usage Example
Getting started with vector search is straightforward.
Additional Changes
Benchmarks
Credits
Thanks @hurley451 for the initial work