Adding document pooling option to the encode function#12
Conversation
|
In the future we could create a dedicated folder dedicated to pooling, it's fine right now to put it here. Amazing feature 👍 |
|
Nice work! I wasn't aware of this pooling approach for ColBERT. Does this allow for faster inference and lower storage costs? |
|
It's not surprising you did not hear about it: it's a project we have been working on with Benjamin and we still did not communicate on it (so please do not leak, although we already merged it in main ColBERT lib), we'll soon release a blog post and then submit a paper. Basically, we found that you can pool the document tokens embeddings using their similarity and it does not degrade the performances of search (up to a certain factor) and allow to store half (or less) the tokens and thus greatly reduce the storing cost of ColBERT models (even more than PLAID). This indeed also reduces the number of tokens to score. |
|
I won't share, no worries. |
Following our work with Benjamin, I add the option to pool the document embeddings to keep 1/pool_factor of the original document tokens.
Our results show that documents can be pooled to a pool_factor of 2 without degradation in performance.
Further pooling can be used for different memory usage/performance trade-offs.