Add LAST_TOKEN pooling implementation for text embedding models#4711
Conversation
PR Reviewer Guide 🔍(Review updated until commit cc36e5f)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to cc36e5f Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit c144310
Suggestions up to commit 6714448
|
The LAST_TOKEN enum value exists in PoolingMode but has no implementation in the translators. This adds the actual pooling logic that extracts the embedding of the last non-padding token, which is needed for decoder-only models (GPT-style, Qwen3, etc.) where the final token captures cumulative context through causal attention. Changes: - Add LAST_TOKEN case to ONNXSentenceTransformerTextEmbeddingTranslator with lastTokenPool() method using int64 attention mask - Add lasttoken case to HuggingfaceTextEmbeddingTranslator with lastTokenPool() method using float32 attention mask - Update pooling mode validation to include lasttoken - Add unit tests for ONNX and TorchScript models - Update documentation with pooling method descriptions - Add release notes entry Resolves opensearch-project#4709 Signed-off-by: Aneesh Nema <aneesh.nema@databricks.com>
6714448 to
c144310
Compare
|
Persistent review updated to latest commit c144310 |
|
Persistent review updated to latest commit cc36e5f |
| return embeddingSum.div(maskSum); | ||
| } | ||
|
|
||
| private NDArray lastTokenPool(NDArray embeddings, NDArray inputAttentionMask) { |
There was a problem hiding this comment.
can this method be reused rather than defined twice?
| Compatible with OpenSearch and OpenSearch Dashboards version 3.4.0 | ||
|
|
||
| ### Enhancements | ||
| * Add LAST_TOKEN pooling support for text embedding models ([#4709](https://github.com/opensearch-project/ml-commons/issues/4709)) |
There was a problem hiding this comment.
You can skip adding this. Let's remove this change, there is an auto generated release notes!
Release notes are auto-generated, removing entries added in opensearch-project#4710 and opensearch-project#4711. Signed-off-by: Aneesh Nema <aneesh.nema@databricks.com>
Description
Adds the implementation for
LAST_TOKENpooling in text embedding translators. TheLAST_TOKENenum value already exists inPoolingModebut had no actual implementation in the translators.LAST_TOKEN pooling extracts the embedding of the last non-padding token, which is the correct pooling strategy for decoder-only models (GPT-style, Qwen3, etc.) where the final token captures cumulative context through causal attention.
How it works:
Changes:
LAST_TOKENcase toONNXSentenceTransformerTextEmbeddingTranslatorwithlastTokenPool()method (uses int64 attention mask viatoLongArray())lasttokencase toHuggingfaceTextEmbeddingTranslatorwithlastTokenPool()method (uses float32 attention mask viatoFloatArray())lasttokenValidated with: Qwen3-Embedding-0.6B producing correct 1024-dimensional normalized embeddings matching Python inference output.
Related Issues
Resolves #4709
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.