-
Notifications
You must be signed in to change notification settings - Fork 2k
[DOC] Add docs for missing embedding functions in python and typescript #5864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| --- | ||
| id: amazon-bedrock | ||
| name: Amazon Bedrock | ||
| --- | ||
|
|
||
| # Amazon Bedrock | ||
|
|
||
| Chroma provides a convenient wrapper around Amazon Bedrock's embedding API. This embedding function runs remotely on Amazon Bedrock's servers, and requires AWS credentials configured via boto3. | ||
|
|
||
| {% Tabs %} | ||
|
|
||
| {% Tab label="python" %} | ||
|
|
||
| This embedding function relies on the `boto3` python package, which you can install with `pip install boto3`. | ||
|
|
||
| ```python | ||
| import boto3 | ||
| from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction | ||
|
|
||
| session = boto3.Session(profile_name="profile", region_name="us-east-1") | ||
| bedrock_ef = AmazonBedrockEmbeddingFunction( | ||
| session=session, | ||
| model_name="amazon.titan-embed-text-v1" | ||
| ) | ||
|
|
||
| texts = ["Hello, world!", "How are you?"] | ||
| embeddings = bedrock_ef(texts) | ||
| ``` | ||
|
|
||
| You can pass in an optional `model_name` argument, which lets you choose which Amazon Bedrock embedding model to use. By default, Chroma uses `amazon.titan-embed-text-v1`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Documentation] In several of the new documentation files, the example code explicitly sets a parameter to its default value, while the following text describes it as optional. This could be slightly confusing for users, as they might think the parameter is required. To improve clarity, consider rephrasing the explanation to acknowledge that the example shows the default being set explicitly, or remove the parameter from the example to demonstrate that it's optional. For example, you could change this line to something like:
This pattern also appears in:
Context for Agents |
||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% /Tabs %} | ||
|
|
||
| {% Banner type="tip" %} | ||
| Visit Amazon Bedrock [documentation](https://docs.aws.amazon.com/bedrock/) for more information on available models and configuration. | ||
| {% /Banner %} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| --- | ||
| id: chroma-bm25 | ||
| name: Chroma BM25 | ||
| --- | ||
|
|
||
| # Chroma BM25 | ||
|
|
||
| Chroma provides a built-in BM25 sparse embedding function. BM25 (Best Matching 25) is a ranking function used to estimate the relevance of documents to a given search query. This embedding function runs locally and does not require any external API keys. | ||
|
|
||
| Sparse embeddings are useful for retrieval tasks where you want to match on specific keywords or terms, rather than semantic similarity. | ||
|
|
||
| {% Tabs %} | ||
|
|
||
| {% Tab label="python" %} | ||
|
|
||
| ```python | ||
| from chromadb.utils.embedding_functions import ChromaBm25EmbeddingFunction | ||
|
|
||
| bm25_ef = ChromaBm25EmbeddingFunction( | ||
| k=1.2, | ||
| b=0.75, | ||
| avg_doc_length=256.0, | ||
| token_max_length=40 | ||
| ) | ||
|
|
||
| texts = ["Hello, world!", "How are you?"] | ||
| sparse_embeddings = bm25_ef(texts) | ||
| ``` | ||
|
|
||
| You can customize the BM25 parameters: | ||
| - `k`: Controls term frequency saturation (default: 1.2) | ||
| - `b`: Controls document length normalization (default: 0.75) | ||
| - `avg_doc_length`: Average document length in tokens (default: 256.0) | ||
| - `token_max_length`: Maximum token length (default: 40) | ||
| - `stopwords`: Optional list of stopwords to exclude | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% Tab label="typescript" %} | ||
|
|
||
| ```typescript | ||
| // npm install @chroma-core/chroma-bm25 | ||
|
|
||
| import { ChromaBm25EmbeddingFunction } from "@chroma-core/chroma-bm25"; | ||
|
|
||
| const embedder = new ChromaBm25EmbeddingFunction({ | ||
| k: 1.2, | ||
| b: 0.75, | ||
| avgDocLength: 256.0, | ||
| tokenMaxLength: 40, | ||
| }); | ||
|
|
||
| // use directly | ||
| const sparseEmbeddings = await embedder.generate(["document1", "document2"]); | ||
|
|
||
| // pass documents to query for .add and .query | ||
| const collection = await client.createCollection({ | ||
| name: "name", | ||
| embeddingFunction: embedder, | ||
| }); | ||
|
Comment on lines
+42
to
+60
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Documentation] The TypeScript code snippet uses a Context for Agents |
||
| ``` | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% /Tabs %} | ||
|
|
||
| {% Banner type="tip" %} | ||
| BM25 is a classic information retrieval algorithm that works well for keyword-based search. For semantic search, consider using dense embedding functions instead. | ||
| {% /Banner %} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| --- | ||
| id: chroma-cloud-qwen | ||
| name: Chroma Cloud Qwen | ||
| --- | ||
|
|
||
| # Chroma Cloud Qwen | ||
|
|
||
| Chroma provides a convenient wrapper around Chroma Cloud's Qwen embedding API. This embedding function runs remotely on Chroma Cloud's servers, and requires a Chroma API key. You can get an API key by signing up for an account at [Chroma Cloud](https://www.trychroma.com/). | ||
|
|
||
| {% Tabs %} | ||
|
|
||
| {% Tab label="python" %} | ||
|
|
||
| This embedding function relies on the `httpx` python package, which you can install with `pip install httpx`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Documentation] Fix capitalization: 'python' should be 'Python'. Context for Agents |
||
|
|
||
| ```python | ||
| from chromadb.utils.embedding_functions import ChromaCloudQwenEmbeddingFunction, ChromaCloudQwenEmbeddingModel | ||
| import os | ||
|
|
||
| os.environ["CHROMA_API_KEY"] = "YOUR_API_KEY" | ||
| qwen_ef = ChromaCloudQwenEmbeddingFunction( | ||
| model=ChromaCloudQwenEmbeddingModel.QWEN3_EMBEDDING_0p6B, | ||
| task="nl_to_code" | ||
| ) | ||
|
|
||
| texts = ["Hello, world!", "How are you?"] | ||
| embeddings = qwen_ef(texts) | ||
| ``` | ||
|
|
||
| You must pass in a `model` argument and `task` argument. The `task` parameter specifies the task for which embeddings are being generated. You can optionally provide custom `instructions` for both documents and queries. | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% Tab label="typescript" %} | ||
|
|
||
| ```typescript | ||
| // npm install @chroma-core/chroma-cloud-qwen | ||
|
|
||
| import { ChromaCloudQwenEmbeddingFunction, ChromaCloudQwenEmbeddingModel } from "@chroma-core/chroma-cloud-qwen"; | ||
|
|
||
| const embedder = new ChromaCloudQwenEmbeddingFunction({ | ||
| apiKeyEnvVar: "CHROMA_API_KEY", // Or set CHROMA_API_KEY env var | ||
| model: ChromaCloudQwenEmbeddingModel.QWEN3_EMBEDDING_0p6B, | ||
| task: "nl_to_code", | ||
| }); | ||
|
|
||
| // use directly | ||
| const embeddings = await embedder.generate(["document1", "document2"]); | ||
|
|
||
| // pass documents to query for .add and .query | ||
| const collection = await client.createCollection({ | ||
| name: "name", | ||
| embeddingFunction: embedder, | ||
| }); | ||
| ``` | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% /Tabs %} | ||
|
|
||
| {% Banner type="tip" %} | ||
| Visit Chroma Cloud [documentation](https://docs.trychroma.com/) for more information on available models and configuration. | ||
| {% /Banner %} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| --- | ||
| id: chroma-cloud-splade | ||
| name: Chroma Cloud Splade | ||
| --- | ||
|
|
||
| # Chroma Cloud Splade | ||
|
|
||
| Chroma provides a convenient wrapper around Chroma Cloud's Splade sparse embedding API. This embedding function runs remotely on Chroma Cloud's servers, and requires a Chroma API key. You can get an API key by signing up for an account at [Chroma Cloud](https://www.trychroma.com/). | ||
|
|
||
| Sparse embeddings are useful for retrieval tasks where you want to match on specific keywords or terms, rather than semantic similarity. | ||
|
|
||
| {% Tabs %} | ||
|
|
||
| {% Tab label="python" %} | ||
|
|
||
| This embedding function relies on the `httpx` python package, which you can install with `pip install httpx`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Documentation] Fix capitalization: 'python' should be 'Python'. Context for Agents |
||
|
|
||
| ```python | ||
| from chromadb.utils.embedding_functions import ChromaCloudSpladeEmbeddingFunction, ChromaCloudSpladeEmbeddingModel | ||
| import os | ||
|
|
||
| os.environ["CHROMA_API_KEY"] = "YOUR_API_KEY" | ||
| splade_ef = ChromaCloudSpladeEmbeddingFunction( | ||
| model=ChromaCloudSpladeEmbeddingModel.SPLADE_PP_EN_V1 | ||
| ) | ||
|
|
||
| texts = ["Hello, world!", "How are you?"] | ||
| sparse_embeddings = splade_ef(texts) | ||
| ``` | ||
|
|
||
| You can optionally pass in a `model` argument. By default, Chroma uses `prithivida/Splade_PP_en_v1`. | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% Tab label="typescript" %} | ||
|
|
||
| ```typescript | ||
| // npm install @chroma-core/chroma-cloud-splade | ||
|
|
||
| import { ChromaCloudSpladeEmbeddingFunction, ChromaCloudSpladeEmbeddingModel } from "@chroma-core/chroma-cloud-splade"; | ||
|
|
||
| const embedder = new ChromaCloudSpladeEmbeddingFunction({ | ||
| apiKeyEnvVar: "CHROMA_API_KEY", // Or set CHROMA_API_KEY env var | ||
| model: ChromaCloudSpladeEmbeddingModel.SPLADE_PP_EN_V1, | ||
| }); | ||
|
|
||
| // use directly | ||
| const sparseEmbeddings = await embedder.generate(["document1", "document2"]); | ||
|
|
||
| // pass documents to query for .add and .query | ||
| const collection = await client.createCollection({ | ||
| name: "name", | ||
| embeddingFunction: embedder, | ||
| }); | ||
| ``` | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% /Tabs %} | ||
|
|
||
| {% Banner type="tip" %} | ||
| Visit Chroma Cloud [documentation](https://docs.trychroma.com/) for more information on available models and configuration. | ||
| {% /Banner %} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| --- | ||
| id: nomic | ||
| name: Nomic | ||
| --- | ||
|
|
||
| # Nomic | ||
|
|
||
| Chroma provides a convenient wrapper around Nomic's embedding API. This embedding function runs remotely on Nomic's servers, and requires an API key. You can get an API key by signing up for an account at [Nomic](https://atlas.nomic.ai/). | ||
|
|
||
| {% Tabs %} | ||
|
|
||
| {% Tab label="python" %} | ||
|
|
||
| This embedding function relies on the `nomic` python package, which you can install with `pip install nomic`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Documentation] Fix capitalization: 'python' should be 'Python'. Context for Agents |
||
|
|
||
| ```python | ||
| from chromadb.utils.embedding_functions import NomicEmbeddingFunction | ||
| import os | ||
|
|
||
| os.environ["NOMIC_API_KEY"] = "YOUR_API_KEY" | ||
| nomic_ef = NomicEmbeddingFunction( | ||
| model="nomic-embed-text-v1", | ||
| task_type="search_document", | ||
| query_config={"task_type": "search_query"} | ||
| ) | ||
|
|
||
| texts = ["Hello, world!", "How are you?"] | ||
| embeddings = nomic_ef(texts) | ||
| ``` | ||
|
|
||
| You must pass in a `model` argument and `task_type` argument. The `task_type` can be one of: | ||
| - `search_document`: Used to encode large documents in retrieval tasks at indexing time | ||
| - `search_query`: Used to encode user queries or questions in retrieval tasks | ||
| - `classification`: Used to encode text for text classification tasks | ||
| - `clustering`: Used for clustering or reranking tasks | ||
|
|
||
| The `query_config` parameter allows you to specify a different task type for queries, which is useful when you want to use `search_document` for documents and `search_query` for queries. | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% /Tabs %} | ||
|
|
||
| {% Banner type="tip" %} | ||
| Visit Nomic [documentation](https://docs.nomic.ai/platform/embeddings-and-retrieval/text-embedding) for more information on available models and task types. | ||
| {% /Banner %} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| --- | ||
| id: open-clip | ||
| name: OpenCLIP | ||
| --- | ||
|
|
||
| # OpenCLIP | ||
|
|
||
| Chroma provides a convenient wrapper around the OpenCLIP library. This embedding function runs locally and supports both text and image embeddings, making it useful for multimodal applications. | ||
|
|
||
| {% Tabs %} | ||
|
|
||
| {% Tab label="python" %} | ||
|
|
||
| This embedding function relies on several python packages: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Documentation] Fix capitalization: 'python' should be 'Python'. Context for Agents |
||
| - `open-clip-torch`: Install with `pip install open-clip-torch` | ||
| - `torch`: Install with `pip install torch` | ||
| - `pillow`: Install with `pip install pillow` | ||
|
|
||
| ```python | ||
| from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction | ||
| import numpy as np | ||
| from PIL import Image | ||
|
|
||
| open_clip_ef = OpenCLIPEmbeddingFunction( | ||
| model_name="ViT-B-32", | ||
| checkpoint="laion2b_s34b_b79k", | ||
| device="cpu" | ||
| ) | ||
|
|
||
| # For text embeddings | ||
| texts = ["Hello, world!", "How are you?"] | ||
| text_embeddings = open_clip_ef(texts) | ||
|
|
||
| # For image embeddings | ||
| images = [np.array(Image.open("image1.jpg")), np.array(Image.open("image2.jpg"))] | ||
| image_embeddings = open_clip_ef(images) | ||
|
|
||
| # Mixed embeddings | ||
| mixed = ["Hello, world!", np.array(Image.open("image1.jpg"))] | ||
| mixed_embeddings = open_clip_ef(mixed) | ||
| ``` | ||
|
|
||
| You can pass in optional arguments: | ||
| - `model_name`: The name of the OpenCLIP model to use (default: "ViT-B-32") | ||
| - `checkpoint`: The checkpoint to use for the model (default: "laion2b_s34b_b79k") | ||
| - `device`: Device used for computation, "cpu" or "cuda" (default: "cpu") | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% /Tabs %} | ||
|
|
||
| {% Banner type="tip" %} | ||
| OpenCLIP is great for multimodal applications where you need to embed both text and images in the same embedding space. Visit [OpenCLIP documentation](https://github.com/mlfoundations/open_clip) for more information on available models and checkpoints. | ||
| {% /Banner %} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| --- | ||
| id: sentence-transformer | ||
| name: Sentence Transformer | ||
| --- | ||
|
|
||
| # Sentence Transformer | ||
|
|
||
| Chroma provides a convenient wrapper around the Sentence Transformers library. This embedding function runs locally and uses pre-trained models from Hugging Face. | ||
|
|
||
| {% Tabs %} | ||
|
|
||
| {% Tab label="python" %} | ||
|
|
||
| This embedding function relies on the `sentence_transformers` python package, which you can install with `pip install sentence_transformers`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [Documentation] Fix capitalization: 'python' should be 'Python'. Context for Agents |
||
|
|
||
| ```python | ||
| from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction | ||
|
|
||
| sentence_transformer_ef = SentenceTransformerEmbeddingFunction( | ||
| model_name="all-MiniLM-L6-v2", | ||
| device="cpu", | ||
| normalize_embeddings=False | ||
| ) | ||
|
|
||
| texts = ["Hello, world!", "How are you?"] | ||
| embeddings = sentence_transformer_ef(texts) | ||
| ``` | ||
|
|
||
| You can pass in optional arguments: | ||
| - `model_name`: The name of the Sentence Transformer model to use (default: "all-MiniLM-L6-v2") | ||
| - `device`: Device used for computation, "cpu" or "cuda" (default: "cpu") | ||
| - `normalize_embeddings`: Whether to normalize returned vectors (default: False) | ||
|
|
||
| For a full list of available models, visit [Hugging Face Sentence Transformers](https://huggingface.co/sentence-transformers) or [SBERT documentation](https://www.sbert.net/docs/pretrained_models.html). | ||
|
|
||
| {% /Tab %} | ||
|
|
||
| {% /Tabs %} | ||
|
|
||
| {% Banner type="tip" %} | ||
| Sentence Transformers are great for semantic search tasks. Popular models include `all-MiniLM-L6-v2` (fast and efficient) and `all-mpnet-base-v2` (higher quality). Visit [SBERT documentation](https://www.sbert.net/docs/pretrained_models.html) for more model recommendations. | ||
| {% /Banner %} | ||
Uh oh!
There was an error while loading. Please reload this page.