Skip to content

Adding document pooling option to the encode function#12

Merged
raphaelsty merged 1 commit intomainfrom
add_pooling
Jun 17, 2024
Merged

Adding document pooling option to the encode function#12
raphaelsty merged 1 commit intomainfrom
add_pooling

Conversation

@NohTow
Copy link
Copy Markdown
Collaborator

@NohTow NohTow commented Jun 17, 2024

Following our work with Benjamin, I add the option to pool the document embeddings to keep 1/pool_factor of the original document tokens.
Our results show that documents can be pooled to a pool_factor of 2 without degradation in performance.
Further pooling can be used for different memory usage/performance trade-offs.

@NohTow NohTow requested a review from raphaelsty June 17, 2024 14:43
@raphaelsty
Copy link
Copy Markdown
Collaborator

In the future we could create a dedicated folder dedicated to pooling, it's fine right now to put it here. Amazing feature 👍

@raphaelsty raphaelsty merged commit 85e97bb into main Jun 17, 2024
@tomaarsen
Copy link
Copy Markdown
Collaborator

Nice work! I wasn't aware of this pooling approach for ColBERT. Does this allow for faster inference and lower storage costs?

@NohTow
Copy link
Copy Markdown
Collaborator Author

NohTow commented Jun 18, 2024

It's not surprising you did not hear about it: it's a project we have been working on with Benjamin and we still did not communicate on it (so please do not leak, although we already merged it in main ColBERT lib), we'll soon release a blog post and then submit a paper.

Basically, we found that you can pool the document tokens embeddings using their similarity and it does not degrade the performances of search (up to a certain factor) and allow to store half (or less) the tokens and thus greatly reduce the storing cost of ColBERT models (even more than PLAID). This indeed also reduces the number of tokens to score.

@tomaarsen
Copy link
Copy Markdown
Collaborator

I won't share, no worries.
That sounds quite promising, good stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants