Skip to content

Update txtai index format to remove Python-specific serialization #769

@davidmezzetti

Description

@davidmezzetti

The txtai index format currently has a number of different components that support persistence as follows.

Component Description
ANN Approximate Nearest Neighbor indexes
Database Content storage
Embeddings Semantic search engine. Integrates other components. Has other storage for configuration and index ids.
Graph Graph networks
Scoring Sparse/keyword indexes

In most cases, an underlying library dictates the storage format. For example, Faiss has it's own index format as does SQLite.

There are cases in the current code base where Python-specific pickle serialization is being used to save content. While the pickle format is fine for local data, it's well documented that sharing data in pickle format is not recommended.

The majority of txtai's use cases are building local indexes. Although there is the ability to sync indexes to cloud storage (object storage, hugging face hub etc). It's best to not use pickle serialization except when working with local and/or temporary data.

The following issues will handle migrating Python-specific pickle serialization to other methods.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions