feat: Created DocEmbedder class#5973
Conversation
|
@ntkathole @jyejare can You pls review this PR and let me know if any changes is needed. |
jyejare
left a comment
There was a problem hiding this comment.
Great Addition @patelchaitany , this is a milestone for Feast in RAG. Glad to see multiple types of data are being supported by Embedder.
Few comments and we should be good to go.
| chunker = TextChunker() | ||
| text = " ".join([f"word{i}" for i in range(200)]) | ||
|
|
||
| chunks = chunker.load_parse_and_chunk(source=text, source_id="doc1") |
There was a problem hiding this comment.
I think chunker should decide the text in source to chunk the text, we should not need to manually feed that.
There was a problem hiding this comment.
Yes, Chunker can decide the which field is text in the DataFrame We Just need to pass the column name in the chunk_dataframe function of the Chunker class.
This test only for testing that load_parse_and_chunk function return the match the required return type.
| def test_supported_modalities(self): | ||
| """After init, supported_modalities returns text and image.""" | ||
| embedder = MultiModalEmbedder() | ||
| modalities = embedder.supported_modalities() |
There was a problem hiding this comment.
Supported modalities can be set as a property
| assert embedder._image_model is None | ||
| assert embedder._image_processor is None | ||
|
|
||
| def test_custom_modality_registration(self): |
There was a problem hiding this comment.
This is for when we register new Modality then It will correctly route to the New Modality.
But I agree we can remove this test.
1ef9d8e to
083eadb
Compare
083eadb to
bb74079
Compare
|
@patelchaitany filename typo - |
| from dataclasses import dataclass | ||
| from typing import Any, Callable, List, Optional | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
Consider lazy-loading numpy too
There was a problem hiding this comment.
we cannot do the lazy of numpy as it is required for the Type checking.
There was a problem hiding this comment.
Gate this behind type check
from typing import TYPE_CHECKING
if TYPE_CHECKING:
import numpy as np
importing feast will fail if numpy is not installed
daf292f to
fcc85cd
Compare
e00ee22 to
ee663f5
Compare
…ng them into the FeatureView schema. - Added BaseChunker and TextChunker classes for document chunking. - Updated pyproject.toml to include sentence-transformers dependency. - Created a new Jupyter notebook example for using the RAG retriever with document embedding. Signed-off-by: Chaitany patel <patelchaitany93@gmail.com>
ee663f5 to
282243b
Compare
| id_column: str, | ||
| source_column: str, | ||
| type_column: Optional[str] = None, | ||
| column_mapping: Optional[dict[str, tuple[str, str]]] = None, |
There was a problem hiding this comment.
column_mapping I think simply can be tuple, column_mapping[source_column] is only used
| {feature_view_name.replace(" ", "_").replace("-", "_")}_source = FileSource( | ||
| name="{feature_view_name}_source", | ||
| path="data/{feature_view_name}.parquet", | ||
| timestamp_field="event_timestamp", | ||
| ) | ||
|
|
||
| # FeatureView | ||
| {feature_view_name.replace(" ", "_").replace("-", "_")} = FeatureView( | ||
| name="{feature_view_name}", |
| name="{feature_view_name}_source", | ||
| path="data/{feature_view_name}.parquet", | ||
| timestamp_field="event_timestamp", | ||
| ) | ||
|
|
||
| # FeatureView | ||
| {feature_view_name.replace(" ", "_").replace("-", "_")} = FeatureView( | ||
| name="{feature_view_name}", |
| from dataclasses import dataclass | ||
| from typing import Any, Callable, List, Optional | ||
|
|
||
| import numpy as np |
There was a problem hiding this comment.
Gate this behind type check
from typing import TYPE_CHECKING
if TYPE_CHECKING:
import numpy as np
importing feast will fail if numpy is not installed
| feature_view_name=feature_view_name, | ||
| vector_length=resolved_vector_length, | ||
| ) | ||
| self.apply_repo() |
There was a problem hiding this comment.
Consider making this explicit, may be a separate setup() method or a auto_apply=True parameter. The constructor always calls apply_repo(), which invokes apply_total, a heavy operation that touches the registry
| self.embedder = embedder or MultiModalEmbedder() | ||
| self.store: Optional[FeatureStore] = None | ||
|
|
||
| if isinstance(logical_layer_fn, LogicalLayerFn): |
There was a problem hiding this comment.
what is isinstance check for ? it's not validating anything
| rag = [ | ||
| "transformers>=4.36.0", | ||
| "datasets>=3.6.0", | ||
| "sentence-transformers>=3.0.0" |
There was a problem hiding this comment.
| "sentence-transformers>=3.0.0" | |
| "sentence-transformers>=3.0.0", |
| source_column, | ||
| row[type_column] if type_column else None, | ||
| ), | ||
| axis=1, |
There was a problem hiding this comment.
def.apply with axis=1 is going to be slow for large dataframes, instead use simple list compression
all_chunks = []
type_values = df[type_column] if type_column else [None] * len(df)
for src, doc_id, doc_type in zip(df[source_column], df[id_column], type_values):
chunks = self.load_parse_and_chunk(src, str(doc_id), source_column, doc_type)
all_chunks.extend(chunks)
if not all_chunks:
return pd.DataFrame(
columns=["chunk_id", "original_id", source_column, "chunk_index"]
)
return pd.DataFrame(all_chunks)
What this PR does / why we need it:
This PR adds a Document Embedder capability to Feast, allowing users to go from raw documents to embeddings stored in the online vector store in a single step. It handles chunking, embedding generation, and writing the results to the online store — providing an end-to-end ingestion pipeline for RAG workflows within Feast.
What changed:
sdk/python/feast/chunker.py
Defines the document chunking layer. Provides:
Currently only basic text chunking is implemented. There is room for improvement — future iterations can support more advanced strategies like semantic chunking, sentence-aware splitting, or format-specific chunkers (PDF, HTML, etc.).
sdk/python/feast/embedder.py
Defines the embedding generation layer. Provides:
sdk/python/feast/doc_embedder.py
The high-level orchestrator that coordinates chunking, embedding, and storage. Provides:
sdk/python/feast/init.py
Updated to export DocEmbedder, LogicalLayerFn, BaseChunker, TextChunker, ChunkingConfig, BaseEmbedder, MultiModalEmbedder, and EmbeddingConfig as part of Feast's public API.
Which issue(s) this PR fixes:
Create DocEmbedder class along with RAGRetriever #5426
Misc