Automatically add EOS via Tokenizer, integrate Sentence Transformers

#2
by tomaarsen HF Staff - opened

Hello!

Preface

I already discussed a lot of these changes over on the MTEB project with @izhx here. He'll already have a good understanding of the changes here.

Pull Request overview

  • Update the tokenizer to automatically add a EOS token. I ran @izhx and my code here: https://github.com/embeddings-benchmark/mteb/pull/2769#issuecomment-2944905730 to add a TemplateProcessing post-processor to the tokenizer that adds the <|endoftext|> on which we perform pooling.
  • Updated the transformers Usage snippet accordingly - it's simpler now, but still gives the same results (feel free to compare)
  • Add the Sentence Transformers configuration files. This model already fits the mold that Sentence Transformers supports, so all we need is some configuration files.
  • Added a simple usage script via Sentence Transformers (note: some third parties like LangChain and LlamaIndex also use Sentence Transformers, so those'll work too)
  • Add some tags to the model card to make this model easier to find with filtering etc.

How to try this PR?

You can run the following to try this out:

Run this PR with Sentence Transformers
# Requires transformers>=4.51.0

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", revision="refs/pr/2")

# We recommend enabling flash_attention_2 for better acceleration and memory saving,
# together with setting `padding_side` to "left":
# model = SentenceTransformer(
#     "Qwen/Qwen3-Embedding-0.6B",
#     revision="refs/pr/2",
#     model_kwargs={"attn_implementation": "flash_attention_2", "device_map": "auto"},
#     tokenizer_kwargs={"padding_side": "left"},
# )

# The queries and documents to embed
queries = [
    "What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

# Encode the queries and documents. Note that queries benefit from using a prompt
# Here we use the prompt called "query" stored under `model.prompts`, but you can
# also pass your own prompt via the `prompt` argument
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.7646, 0.1414],
#         [0.1355, 0.6000]])

(Note the revision argument)

Or

Run this PR with Transformers
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, 'What is the capital of China?'),
    get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-0.6B', padding_side='left', revision="refs/pr/2")
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-0.6B', revision="refs/pr/2")

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-0.6B', attn_implementation="flash_attention_2", torch_dtype=torch.float16, revision="refs/pr/2").cuda()

eod_id = tokenizer.convert_tokens_to_ids("<|endoftext|>")
max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())

Question

With this PR made, I can make identical changes to the 4B and 8B model. Please let me know if you would welcome that!

cc @littlebird13 @JustinLin610 @izhx

  • Tom Aarsen
tomaarsen changed pull request status to open

You can also remove the eod_id line in the README :)

Good call!

I checked, the outputs are consistent!

It also helps vLLM to produce the correct outputs. 🤣

Checking other code..

littlebird13 changed pull request status to merged
Qwen org

@tomaarsen May I ask which version of the sentence-transformers library supports qwen3-embedding? It would be helpful to include this information in the README.

Sign up or log in to comment