sentence_transformers_support
Hello!
Preface
I'm a big fan of this sparse models!
Pull Request Overview
Handle the model in the Sentence Transformers library.
Details
The SentenceTransformer
library will soon add support for sparse models through the SparseEncoder
class.
We would like to add support for this model, and with this PR it is now properly handled.
We modified as little as possible, so it should work with any other custom loading logic you may have.
You will first need to install the current version of the library:
pip install git+https://github.com/arthurbr11/sentence-transformers.git@sparse_implementation
Feel free to run this code using revision="refs/pr/3" in the AutoTokenizer, AutoModelForMaskedLM, etc. to test this PR with your custom code or with the one below before merging (it's the one reproducing your example):
from sentence_transformers import SparseEncoder
# Download from the 🤗 Hub
model = SparseEncoder("ibm-granite/granite-embedding-30m-sparse", revision="refs/pr/3")
# Run inference
docs = [
"Artificial intelligence was founded as an academic discipline in 1956.",
"Alan Turing was the first person to conduct substantial research in AI.",
"Born in Maida Vale, London, Turing was raised in southern England.",
]
docs_embeddings = model.encode_document(docs, max_active_dims=192)
print(docs_embeddings.shape)
# [3, 50265]
queries = ["When was artificial intelligence founded", "Where was Turing born?"]
queries_embeddings = model.encode_query(queries, max_active_dims=50)
print(queries_embeddings.shape)
# [2, 50265]
# Get the similarity scores for the embeddings
similarities = model.similarity(queries_embeddings, docs_embeddings)
print(similarities.shape)
# [2, 3]
for i, query in enumerate(queries):
best_doc_index = similarities[i].argmax().item()
print(f"Query: {query}")
print(f"Best doc associate: Similarity: {similarities[i][best_doc_index]:.4f}, Doc: {docs[best_doc_index]}")
intersection = model.intersection(queries_embeddings[i], docs_embeddings[best_doc_index])
decoded_intersection = model.decode(intersection, top_k=10)
print("Top 10 tokens influencing the similarity:")
for token, score in decoded_intersection:
print(f"Token: {token}, Score: {score:.4f}")
# Query: When was artificial intelligence founded
# Best doc associate: Similarity: 12.3641, Doc: Artificial intelligence was founded as an academic discipline in 1956.
# Top 10 tokens influencing the similarity:
# Token: ĠAI, Score: 2.7591
# Token: Ġintelligence, Score: 2.2971
# Token: Ġartificial, Score: 1.7654
# Token: Ġfounded, Score: 1.3254
# Token: Ġinvention, Score: 0.9808
# Token: Ġlearning, Score: 0.4847
# Token: Ġcomputer, Score: 0.4789
# Token: Ġrobot, Score: 0.3466
# Token: Ġestablishment, Score: 0.3371
# Token: Ġscientific, Score: 0.2804
# Query: Where was Turing born?
# Best doc associate: Similarity: 17.1359, Doc: Born in Maida Vale, London, Turing was raised in southern England.
# Top 10 tokens influencing the similarity:
# Token: uring, Score: 2.9761
# Token: ĠTuring, Score: 2.4544
# Token: Ġborn, Score: 2.4314
# Token: ing, Score: 1.7760
# Token: ure, Score: 1.7626
# Token: Ġcomput, Score: 1.3356
# Token: Ġraised, Score: 1.3285
# Token: able, Score: 1.1940
# Token: Ġphilosopher, Score: 0.4118
# Token: Ġmachine, Score: 0.3977
cc @tomaarsen
Arthur BRESNU
Hello!
Tomorrow we're rolling out a big v5.0 release for Sentence Transformers, introducing support for Sparse Embedding models beyond the standard Dense ones.
The code snippet above should be runnable out of the box, and you can use it to verify that the integration works as expected. If you're able to merge this today or tomorrow, then we can announce support of the IBM Sparse model on our day-0 communications about the release, which we'd love. Please let us know if you have any questions or comments! Also feel free to reach out via the ibm-granite-collab HF <> IBM channel on Slack.
- Tom Aarsen