Adding document pooling option to the encode function by NohTow · Pull Request #12 · lightonai/pylate

NohTow · 2024-06-17T14:43:37Z

Following our work with Benjamin, I add the option to pool the document embeddings to keep 1/pool_factor of the original document tokens.
Our results show that documents can be pooled to a pool_factor of 2 without degradation in performance.
Further pooling can be used for different memory usage/performance trade-offs.

raphaelsty · 2024-06-17T14:45:59Z

In the future we could create a dedicated folder dedicated to pooling, it's fine right now to put it here. Amazing feature 👍

tomaarsen · 2024-06-18T10:57:29Z

Nice work! I wasn't aware of this pooling approach for ColBERT. Does this allow for faster inference and lower storage costs?

NohTow · 2024-06-18T11:10:19Z

It's not surprising you did not hear about it: it's a project we have been working on with Benjamin and we still did not communicate on it (so please do not leak, although we already merged it in main ColBERT lib), we'll soon release a blog post and then submit a paper.

Basically, we found that you can pool the document tokens embeddings using their similarity and it does not degrade the performances of search (up to a certain factor) and allow to store half (or less) the tokens and thus greatly reduce the storing cost of ColBERT models (even more than PLAID). This indeed also reduces the number of tokens to score.

tomaarsen · 2024-06-18T12:34:54Z

I won't share, no worries.
That sounds quite promising, good stuff.

Adding document pooling option to the encode function

571e979

NohTow requested a review from raphaelsty June 17, 2024 14:43

raphaelsty merged commit 85e97bb into main Jun 17, 2024

NohTow mentioned this pull request Jun 21, 2024

Add the option of pooling the tokens #6

Closed

raphaelsty deleted the add_pooling branch August 22, 2024 10:31

meetdoshi90 mentioned this pull request Nov 26, 2025

Error while indexing large datasets like webis-touche using Fast Plaid #180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding document pooling option to the encode function#12

Adding document pooling option to the encode function#12
raphaelsty merged 1 commit intomainfrom
add_pooling

NohTow commented Jun 17, 2024 •

edited

Loading

Uh oh!

raphaelsty commented Jun 17, 2024

Uh oh!

tomaarsen commented Jun 18, 2024

Uh oh!

NohTow commented Jun 18, 2024 •

edited

Loading

Uh oh!

tomaarsen commented Jun 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NohTow commented Jun 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raphaelsty commented Jun 17, 2024

Uh oh!

tomaarsen commented Jun 18, 2024

Uh oh!

NohTow commented Jun 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaarsen commented Jun 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NohTow commented Jun 17, 2024 •

edited

Loading

NohTow commented Jun 18, 2024 •

edited

Loading