-
Notifications
You must be signed in to change notification settings - Fork 776
Description
The txtai index format currently has a number of different components that support persistence as follows.
| Component | Description |
|---|---|
| ANN | Approximate Nearest Neighbor indexes |
| Database | Content storage |
| Embeddings | Semantic search engine. Integrates other components. Has other storage for configuration and index ids. |
| Graph | Graph networks |
| Scoring | Sparse/keyword indexes |
In most cases, an underlying library dictates the storage format. For example, Faiss has it's own index format as does SQLite.
There are cases in the current code base where Python-specific pickle serialization is being used to save content. While the pickle format is fine for local data, it's well documented that sharing data in pickle format is not recommended.
The majority of txtai's use cases are building local indexes. Although there is the ability to sync indexes to cloud storage (object storage, hugging face hub etc). It's best to not use pickle serialization except when working with local and/or temporary data.
The following issues will handle migrating Python-specific pickle serialization to other methods.
- Add serialization package for handling supported data serialization methods #770
- Add MessagePack serialization as a top level dependency #771
- Modify NumPy and Torch ANN components to use np.load/np.save #772
- Persist Embeddings index ids (only used when content storage is disabled) with MessagePack #773
- Persist Reducer component with skops library #774
- Persist NetworkX graph component with MessagePack #775
- Persist Scoring component metadata with MessagePack #776
- Modify vector transforms to load/save data using np.load/np.save #777
- Refactor embeddings configuration into separate component #778