[ENH] Update Fastembed embedding function with more parameters, add bm25 embedding function #5489

jairad26 · 2025-09-16T23:41:05Z

This PR updates the fastembed embedding function with more parameters to support, and adds bm25 embedding function , which is a thin wrapper around the fastembed bm25 class
Summarize the changes made by this PR.

Improvements & Bug fixes
- ...
New functionality
- ...

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

github-actions · 2025-09-16T23:41:12Z

jairad26 · 2025-09-16T23:41:21Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

propel-code-bot · 2025-09-16T23:42:45Z

Add BM25 Embedding Function and Enhance Fastembed Support

This pull request introduces a new embedding function, Bm25EmbeddingFunction, as a wrapper for the BM25 model provided by the fastembed library. In addition, the existing FastembedSparseEmbeddingFunction is enhanced to support a wider range of parameters, allowing for greater customization of the underlying fastembed model. The changes update code, configuration flows, documentation strings, tests, and exports to accommodate the new functionality.

Key Changes

• Added new file chromadb/utils/embedding_functions/bm25_embedding_function.py implementing Bm25EmbeddingFunction, a thin wrapper around fastembed.sparse.bm25.Bm25.
• Extended FastembedSparseEmbeddingFunction (chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py) to accept additional parameters such as cache_dir, threads, cuda, device_ids, lazy_load, and generic keyword arguments.
• Updated error messages across embedding functions for correctness and clarity regarding missing dependencies.
• Modified chromadb/utils/embedding_functions/__init__.py to import and export Bm25EmbeddingFunction, and to include it in built-in and export sets.
• Updated tests (chromadb/test/ef/test_ef.py) to validate the presence of Bm25EmbeddingFunction in the set of built-in embedding functions.
• Expanded docstrings and parameter documentation for clarity in all relevant embedding function files.
• Minor improvements to docstrings in HuggingFaceSparseEmbeddingFunction.

Affected Areas

• chromadb/utils/embedding_functions/bm25_embedding_function.py
• chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py
• chromadb/utils/embedding_functions/__init__.py
• chromadb/test/ef/test_ef.py
• chromadb/utils/embedding_functions/huggingface_sparse_embedding_function.py

This summary was automatically generated by @propel-code-bot

chromadb/utils/embedding_functions/bm25_embedding_function.py

chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py

chromadb/utils/embedding_functions/bm25_embedding_function.py

propel-code-bot · 2025-09-17T00:02:42Z

chromadb/utils/embedding_functions/bm25_embedding_function.py

+        self._model = Bm25(
+            model_name,
+            cache_dir,
+            k,
+            b,
+            avg_len,
+            language,
+            token_max_length,
+            disable_stemmer,
+            specific_model_path,
+            **kwargs,
+        )


[BestPractice]

Potential runtime error: The Bm25 constructor is called with positional arguments, but if any of the optional parameters (cache_dir, k, b, etc.) are None, this could cause issues depending on how the Bm25 class handles None values. Consider using keyword arguments or filtering out None values:

Suggested change

self._model = Bm25(

model_name,

cache_dir,

k,

b,

avg_len,

language,

token_max_length,

disable_stemmer,

specific_model_path,

**kwargs,

)

# Filter out None values for cleaner initialization

kwargs_filtered = {k: v for k, v in {

'cache_dir': cache_dir,

'k': k, 'b': b, 'avg_len': avg_len,

'language': language,

'token_max_length': token_max_length,

'disable_stemmer': disable_stemmer,

'specific_model_path': specific_model_path,

**kwargs

}.items() if v is not None}

self._model = Bm25(model_name, **kwargs_filtered)

⚡ Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Context for Agents

[**BestPractice**] Potential runtime error: The `Bm25` constructor is called with positional arguments, but if any of the optional parameters (`cache_dir`, `k`, `b`, etc.) are `None`, this could cause issues depending on how the `Bm25` class handles `None` values. Consider using keyword arguments or filtering out `None` values: ```suggestion # Filter out None values for cleaner initialization kwargs_filtered = {k: v for k, v in { 'cache_dir': cache_dir, 'k': k, 'b': b, 'avg_len': avg_len, 'language': language, 'token_max_length': token_max_length, 'disable_stemmer': disable_stemmer, 'specific_model_path': specific_model_path, **kwargs }.items() if v is not None} self._model = Bm25(model_name, **kwargs_filtered) ``` ⚡ **Committable suggestion** Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. File: chromadb/utils/embedding_functions/bm25_embedding_function.py Line: 85

propel-code-bot · 2025-09-17T00:02:43Z

chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py

            model_name (str, optional): Identifier of the Splade model
            List of commonly used models: Qdrant/bm25, prithivida/Splade_PP_en_v1, Qdrant/minicoil-v1


[Documentation]

The docstring for model_name lists SPLADE and BM25 models as examples. Since this is a generic FastembedSparseEmbeddingFunction, it might be better to provide a more general description or point to the fastembed documentation for available sparse models to avoid confusion if other types of sparse models are supported in the future.

Context for Agents

[**Documentation**] The docstring for `model_name` lists SPLADE and BM25 models as examples. Since this is a generic `FastembedSparseEmbeddingFunction`, it might be better to provide a more general description or point to the `fastembed` documentation for available sparse models to avoid confusion if other types of sparse models are supported in the future. File: chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py Line: 35

chromadb/utils/embedding_functions/bm25_embedding_function.py

…m25 embedding function

propel-code-bot · 2025-09-17T00:34:23Z

chromadb/utils/embedding_functions/bm25_embedding_function.py

+        if self.task == "document":
+            embeddings = model.embed(
+                list(input),
+            )


[BestPractice]

Potential runtime failure: model.embed() and model.query_embed() can raise exceptions if the model fails to load or process input, but these exceptions are not caught. If the fastembed model encounters an error (corrupted model files, insufficient memory, invalid input format), the function will crash with an unhandled exception.

Consider wrapping the embedding calls with proper error handling:

try: if self.task == "document": embeddings = model.embed(list(input)) elif self.task == "query": embeddings = model.query_embed(list(input)) except Exception as e: raise ValueError(f"Failed to generate embeddings: {e}")

Context for Agents

[**BestPractice**] Potential runtime failure: `model.embed()` and `model.query_embed()` can raise exceptions if the model fails to load or process input, but these exceptions are not caught. If the fastembed model encounters an error (corrupted model files, insufficient memory, invalid input format), the function will crash with an unhandled exception. Consider wrapping the embedding calls with proper error handling: ```python try: if self.task == "document": embeddings = model.embed(list(input)) elif self.task == "query": embeddings = model.query_embed(list(input)) except Exception as e: raise ValueError(f"Failed to generate embeddings: {e}") ``` File: chromadb/utils/embedding_functions/bm25_embedding_function.py Line: 107

propel-code-bot · 2025-09-17T00:34:24Z

chromadb/utils/embedding_functions/bm25_embedding_function.py

+            if task == "document":
+                embeddings = model.embed(
+                    list(input),
+                )


[BestPractice]

Same embedding failure scenario as in __call__ method - model.embed() and model.query_embed() can raise unhandled exceptions. Apply the same error handling pattern here.

Context for Agents

[**BestPractice**] Same embedding failure scenario as in `__call__` method - `model.embed()` and `model.query_embed()` can raise unhandled exceptions. Apply the same error handling pattern here. File: chromadb/utils/embedding_functions/bm25_embedding_function.py Line: 138

propel-code-bot · 2025-09-17T00:34:25Z

chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py

+        self._model = SparseTextEmbedding(
+            model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs
+        )


[BestPractice]

Model initialization failure: SparseTextEmbedding(model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs) can fail for various reasons (invalid parameters, missing model files, CUDA issues, etc.) but exceptions are not handled. This will cause the constructor to crash.

Add error handling:

try: self._model = SparseTextEmbedding( model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs ) except Exception as e: raise ValueError(f"Failed to initialize SparseTextEmbedding model: {e}")

Context for Agents

[**BestPractice**] Model initialization failure: `SparseTextEmbedding(model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs)` can fail for various reasons (invalid parameters, missing model files, CUDA issues, etc.) but exceptions are not handled. This will cause the constructor to crash. Add error handling: ```python try: self._model = SparseTextEmbedding( model_name, cache_dir, threads, cuda, device_ids, lazy_load, **kwargs ) except Exception as e: raise ValueError(f"Failed to initialize SparseTextEmbedding model: {e}") ``` File: chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py Line: 66

jairad26 marked this pull request as ready for review September 16, 2025 23:42

propel-code-bot bot reviewed Sep 16, 2025

View reviewed changes

chromadb/utils/embedding_functions/bm25_embedding_function.py Show resolved Hide resolved

propel-code-bot bot reviewed Sep 16, 2025

View reviewed changes

chromadb/utils/embedding_functions/bm25_embedding_function.py Show resolved Hide resolved

propel-code-bot bot reviewed Sep 16, 2025

View reviewed changes

chromadb/utils/embedding_functions/fastembed_sparse_embedding_function.py Show resolved Hide resolved

propel-code-bot bot reviewed Sep 16, 2025

View reviewed changes

chromadb/utils/embedding_functions/bm25_embedding_function.py Show resolved Hide resolved

jairad26 force-pushed the jai/add-bm25-ef branch from 25a830c to 4adf910 Compare September 16, 2025 23:53

propel-code-bot bot reviewed Sep 17, 2025

View reviewed changes

chromadb/utils/embedding_functions/bm25_embedding_function.py Show resolved Hide resolved

blacksmith-sh bot deleted a comment from jairad26 Sep 17, 2025

[ENH] Update Fastembed embedding function with more parameters, add b…

0f4a215

…m25 embedding function

jairad26 force-pushed the jai/add-bm25-ef branch from 4adf910 to 0f4a215 Compare September 17, 2025 00:25

propel-code-bot bot reviewed Sep 17, 2025

View reviewed changes

Sicheng-Pan approved these changes Sep 17, 2025

View reviewed changes

jairad26 enabled auto-merge (squash) September 17, 2025 00:35

jairad26 merged commit 79e525e into main Sep 17, 2025
59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ENH] Update Fastembed embedding function with more parameters, add bm25 embedding function #5489

[ENH] Update Fastembed embedding function with more parameters, add bm25 embedding function #5489

Uh oh!

jairad26 commented Sep 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 16, 2025

Uh oh!

jairad26 commented Sep 16, 2025

Uh oh!

propel-code-bot bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

propel-code-bot bot Sep 17, 2025

Uh oh!

propel-code-bot bot Sep 17, 2025

Uh oh!

Uh oh!

propel-code-bot bot Sep 17, 2025

Uh oh!

propel-code-bot bot Sep 17, 2025

Uh oh!

propel-code-bot bot Sep 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		model_name (str, optional): Identifier of the Splade model
		List of commonly used models: Qdrant/bm25, prithivida/Splade_PP_en_v1, Qdrant/minicoil-v1

[ENH] Update Fastembed embedding function with more parameters, add bm25 embedding function #5489

[ENH] Update Fastembed embedding function with more parameters, add bm25 embedding function #5489

Uh oh!

Conversation

jairad26 commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

github-actions bot commented Sep 16, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

jairad26 commented Sep 16, 2025

Uh oh!

propel-code-bot bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

propel-code-bot bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

propel-code-bot bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jairad26 commented Sep 16, 2025 •

edited

Loading

propel-code-bot bot commented Sep 16, 2025 •

edited

Loading