Skip to content

Conversation

@jairad26
Copy link
Contributor

@jairad26 jairad26 commented Sep 16, 2025

This PR updates the fastembed embedding function with more parameters to support, and adds bm25 embedding function , which is a thin wrapper around the fastembed bm25 class
Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • ...
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@jairad26 jairad26 marked this pull request as ready for review September 16, 2025 23:42
@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Sep 16, 2025

Add BM25 Embedding Function and Enhance Fastembed Support

This pull request introduces a new embedding function, Bm25EmbeddingFunction, as a wrapper for the BM25 model provided by the fastembed library. In addition, the existing FastembedSparseEmbeddingFunction is enhanced to support a wider range of parameters, allowing for greater customization of the underlying fastembed model. The changes update code, configuration flows, documentation strings, tests, and exports to accommodate the new functionality.

Key Changes

• Added new file chromadb/utils/embedding_functions/bm25_embedding_function.py implementing Bm25EmbeddingFunction, a thin wrapper around fastembed.sparse.bm25.Bm25.
• Extended FastembedSparseEmbeddingFunction (chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py) to accept additional parameters such as cache_dir, threads, cuda, device_ids, lazy_load, and generic keyword arguments.
• Updated error messages across embedding functions for correctness and clarity regarding missing dependencies.
• Modified chromadb/utils/embedding_functions/__init__.py to import and export Bm25EmbeddingFunction, and to include it in built-in and export sets.
• Updated tests (chromadb/test/ef/test_ef.py) to validate the presence of Bm25EmbeddingFunction in the set of built-in embedding functions.
• Expanded docstrings and parameter documentation for clarity in all relevant embedding function files.
• Minor improvements to docstrings in HuggingFaceSparseEmbeddingFunction.

Affected Areas

chromadb/utils/embedding_functions/bm25_embedding_function.py
chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py
chromadb/utils/embedding_functions/__init__.py
chromadb/test/ef/test_ef.py
chromadb/utils/embedding_functions/huggingface_sparse_embedding_function.py

This summary was automatically generated by @propel-code-bot

Comment on lines 74 to 85
self._model = Bm25(
model_name,
cache_dir,
k,
b,
avg_len,
language,
token_max_length,
disable_stemmer,
specific_model_path,
**kwargs,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Potential runtime error: The Bm25 constructor is called with positional arguments, but if any of the optional parameters (cache_dir, k, b, etc.) are None, this could cause issues depending on how the Bm25 class handles None values. Consider using keyword arguments or filtering out None values:

Suggested change
self._model = Bm25(
model_name,
cache_dir,
k,
b,
avg_len,
language,
token_max_length,
disable_stemmer,
specific_model_path,
**kwargs,
)
# Filter out None values for cleaner initialization
kwargs_filtered = {k: v for k, v in {
'cache_dir': cache_dir,
'k': k, 'b': b, 'avg_len': avg_len,
'language': language,
'token_max_length': token_max_length,
'disable_stemmer': disable_stemmer,
'specific_model_path': specific_model_path,
**kwargs
}.items() if v is not None}
self._model = Bm25(model_name, **kwargs_filtered)

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents
[**BestPractice**]

Potential runtime error: The `Bm25` constructor is called with positional arguments, but if any of the optional parameters (`cache_dir`, `k`, `b`, etc.) are `None`, this could cause issues depending on how the `Bm25` class handles `None` values. Consider using keyword arguments or filtering out `None` values:

```suggestion
# Filter out None values for cleaner initialization
kwargs_filtered = {k: v for k, v in {
    'cache_dir': cache_dir,
    'k': k, 'b': b, 'avg_len': avg_len,
    'language': language,
    'token_max_length': token_max_length,
    'disable_stemmer': disable_stemmer,
    'specific_model_path': specific_model_path,
    **kwargs
}.items() if v is not None}

self._model = Bm25(model_name, **kwargs_filtered)
```

⚡ **Committable suggestion**

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

File: chromadb/utils/embedding_functions/bm25_embedding_function.py
Line: 85

Comment on lines 34 to 35
model_name (str, optional): Identifier of the Splade model
List of commonly used models: Qdrant/bm25, prithivida/Splade_PP_en_v1, Qdrant/minicoil-v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Documentation]

The docstring for model_name lists SPLADE and BM25 models as examples. Since this is a generic FastembedSparseEmbeddingFunction, it might be better to provide a more general description or point to the fastembed documentation for available sparse models to avoid confusion if other types of sparse models are supported in the future.

Context for Agents
[**Documentation**]

The docstring for `model_name` lists SPLADE and BM25 models as examples. Since this is a generic `FastembedSparseEmbeddingFunction`, it might be better to provide a more general description or point to the `fastembed` documentation for available sparse models to avoid confusion if other types of sparse models are supported in the future.

File: chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py
Line: 35

@blacksmith-sh blacksmith-sh bot deleted a comment from jairad26 Sep 17, 2025
Comment on lines +104 to +107
if self.task == "document":
embeddings = model.embed(
list(input),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Potential runtime failure: model.embed() and model.query_embed() can raise exceptions if the model fails to load or process input, but these exceptions are not caught. If the fastembed model encounters an error (corrupted model files, insufficient memory, invalid input format), the function will crash with an unhandled exception.

Consider wrapping the embedding calls with proper error handling:

try:
    if self.task == "document":
        embeddings = model.embed(list(input))
    elif self.task == "query":
        embeddings = model.query_embed(list(input))
except Exception as e:
    raise ValueError(f"Failed to generate embeddings: {e}")
Context for Agents
[**BestPractice**]

Potential runtime failure: `model.embed()` and `model.query_embed()` can raise exceptions if the model fails to load or process input, but these exceptions are not caught. If the fastembed model encounters an error (corrupted model files, insufficient memory, invalid input format), the function will crash with an unhandled exception.

Consider wrapping the embedding calls with proper error handling:

```python
try:
    if self.task == "document":
        embeddings = model.embed(list(input))
    elif self.task == "query":
        embeddings = model.query_embed(list(input))
except Exception as e:
    raise ValueError(f"Failed to generate embeddings: {e}")
```

File: chromadb/utils/embedding_functions/bm25_embedding_function.py
Line: 107

Comment on lines +135 to +138
if task == "document":
embeddings = model.embed(
list(input),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Same embedding failure scenario as in __call__ method - model.embed() and model.query_embed() can raise unhandled exceptions. Apply the same error handling pattern here.

Context for Agents
[**BestPractice**]

Same embedding failure scenario as in `__call__` method - `model.embed()` and `model.query_embed()` can raise unhandled exceptions. Apply the same error handling pattern here.

File: chromadb/utils/embedding_functions/bm25_embedding_function.py
Line: 138

Comment on lines +64 to +66
self._model = SparseTextEmbedding(
model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Model initialization failure: SparseTextEmbedding(model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs) can fail for various reasons (invalid parameters, missing model files, CUDA issues, etc.) but exceptions are not handled. This will cause the constructor to crash.

Add error handling:

try:
    self._model = SparseTextEmbedding(
        model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs
    )
except Exception as e:
    raise ValueError(f"Failed to initialize SparseTextEmbedding model: {e}")
Context for Agents
[**BestPractice**]

Model initialization failure: `SparseTextEmbedding(model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs)` can fail for various reasons (invalid parameters, missing model files, CUDA issues, etc.) but exceptions are not handled. This will cause the constructor to crash.

Add error handling:

```python
try:
    self._model = SparseTextEmbedding(
        model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs
    )
except Exception as e:
    raise ValueError(f"Failed to initialize SparseTextEmbedding model: {e}")
```

File: chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py
Line: 66

@jairad26 jairad26 enabled auto-merge (squash) September 17, 2025 00:35
@jairad26 jairad26 merged commit 79e525e into main Sep 17, 2025
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants