-
Notifications
You must be signed in to change notification settings - Fork 2k
[ENH] Update Fastembed embedding function with more parameters, add bm25 embedding function #5489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
Add BM25 Embedding Function and Enhance Fastembed Support This pull request introduces a new embedding function, Key Changes• Added new file Affected Areas• This summary was automatically generated by @propel-code-bot |
chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py
Show resolved
Hide resolved
25a830c to
4adf910
Compare
| self._model = Bm25( | ||
| model_name, | ||
| cache_dir, | ||
| k, | ||
| b, | ||
| avg_len, | ||
| language, | ||
| token_max_length, | ||
| disable_stemmer, | ||
| specific_model_path, | ||
| **kwargs, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential runtime error: The Bm25 constructor is called with positional arguments, but if any of the optional parameters (cache_dir, k, b, etc.) are None, this could cause issues depending on how the Bm25 class handles None values. Consider using keyword arguments or filtering out None values:
| self._model = Bm25( | |
| model_name, | |
| cache_dir, | |
| k, | |
| b, | |
| avg_len, | |
| language, | |
| token_max_length, | |
| disable_stemmer, | |
| specific_model_path, | |
| **kwargs, | |
| ) | |
| # Filter out None values for cleaner initialization | |
| kwargs_filtered = {k: v for k, v in { | |
| 'cache_dir': cache_dir, | |
| 'k': k, 'b': b, 'avg_len': avg_len, | |
| 'language': language, | |
| 'token_max_length': token_max_length, | |
| 'disable_stemmer': disable_stemmer, | |
| 'specific_model_path': specific_model_path, | |
| **kwargs | |
| }.items() if v is not None} | |
| self._model = Bm25(model_name, **kwargs_filtered) |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Context for Agents
[**BestPractice**]
Potential runtime error: The `Bm25` constructor is called with positional arguments, but if any of the optional parameters (`cache_dir`, `k`, `b`, etc.) are `None`, this could cause issues depending on how the `Bm25` class handles `None` values. Consider using keyword arguments or filtering out `None` values:
```suggestion
# Filter out None values for cleaner initialization
kwargs_filtered = {k: v for k, v in {
'cache_dir': cache_dir,
'k': k, 'b': b, 'avg_len': avg_len,
'language': language,
'token_max_length': token_max_length,
'disable_stemmer': disable_stemmer,
'specific_model_path': specific_model_path,
**kwargs
}.items() if v is not None}
self._model = Bm25(model_name, **kwargs_filtered)
```
⚡ **Committable suggestion**
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
File: chromadb/utils/embedding_functions/bm25_embedding_function.py
Line: 85| model_name (str, optional): Identifier of the Splade model | ||
| List of commonly used models: Qdrant/bm25, prithivida/Splade_PP_en_v1, Qdrant/minicoil-v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Documentation]
The docstring for model_name lists SPLADE and BM25 models as examples. Since this is a generic FastembedSparseEmbeddingFunction, it might be better to provide a more general description or point to the fastembed documentation for available sparse models to avoid confusion if other types of sparse models are supported in the future.
Context for Agents
[**Documentation**]
The docstring for `model_name` lists SPLADE and BM25 models as examples. Since this is a generic `FastembedSparseEmbeddingFunction`, it might be better to provide a more general description or point to the `fastembed` documentation for available sparse models to avoid confusion if other types of sparse models are supported in the future.
File: chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py
Line: 35…m25 embedding function
4adf910 to
0f4a215
Compare
| if self.task == "document": | ||
| embeddings = model.embed( | ||
| list(input), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential runtime failure: model.embed() and model.query_embed() can raise exceptions if the model fails to load or process input, but these exceptions are not caught. If the fastembed model encounters an error (corrupted model files, insufficient memory, invalid input format), the function will crash with an unhandled exception.
Consider wrapping the embedding calls with proper error handling:
try:
if self.task == "document":
embeddings = model.embed(list(input))
elif self.task == "query":
embeddings = model.query_embed(list(input))
except Exception as e:
raise ValueError(f"Failed to generate embeddings: {e}")Context for Agents
[**BestPractice**]
Potential runtime failure: `model.embed()` and `model.query_embed()` can raise exceptions if the model fails to load or process input, but these exceptions are not caught. If the fastembed model encounters an error (corrupted model files, insufficient memory, invalid input format), the function will crash with an unhandled exception.
Consider wrapping the embedding calls with proper error handling:
```python
try:
if self.task == "document":
embeddings = model.embed(list(input))
elif self.task == "query":
embeddings = model.query_embed(list(input))
except Exception as e:
raise ValueError(f"Failed to generate embeddings: {e}")
```
File: chromadb/utils/embedding_functions/bm25_embedding_function.py
Line: 107| if task == "document": | ||
| embeddings = model.embed( | ||
| list(input), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Same embedding failure scenario as in __call__ method - model.embed() and model.query_embed() can raise unhandled exceptions. Apply the same error handling pattern here.
Context for Agents
[**BestPractice**]
Same embedding failure scenario as in `__call__` method - `model.embed()` and `model.query_embed()` can raise unhandled exceptions. Apply the same error handling pattern here.
File: chromadb/utils/embedding_functions/bm25_embedding_function.py
Line: 138| self._model = SparseTextEmbedding( | ||
| model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Model initialization failure: SparseTextEmbedding(model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs) can fail for various reasons (invalid parameters, missing model files, CUDA issues, etc.) but exceptions are not handled. This will cause the constructor to crash.
Add error handling:
try:
self._model = SparseTextEmbedding(
model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs
)
except Exception as e:
raise ValueError(f"Failed to initialize SparseTextEmbedding model: {e}")Context for Agents
[**BestPractice**]
Model initialization failure: `SparseTextEmbedding(model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs)` can fail for various reasons (invalid parameters, missing model files, CUDA issues, etc.) but exceptions are not handled. This will cause the constructor to crash.
Add error handling:
```python
try:
self._model = SparseTextEmbedding(
model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs
)
except Exception as e:
raise ValueError(f"Failed to initialize SparseTextEmbedding model: {e}")
```
File: chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py
Line: 66
This PR updates the fastembed embedding function with more parameters to support, and adds bm25 embedding function , which is a thin wrapper around the fastembed bm25 class
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?