SPLADE distilbert-base-uncased trained on python docstring code pairs

This is a SPLADE Sparse Encoder model finetuned from distilbert/distilbert-base-uncased using the sentence-transformers library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

Model Details

Model Description

Model Type: SPLADE Sparse Encoder
Base model: distilbert/distilbert-base-uncased
Maximum Sequence Length: 512 tokens
Output Dimensionality: 30522 dimensions
Similarity Function: Dot Product
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Documentation: Sparse Encoder Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sparse Encoders on Hugging Face

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'DistilBertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("pulkitmehtawork/sparse-distilbert-base-uncased-python-code-lightening")
# Run inference
sentences = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[2148.8340, 1376.2744,  850.4404],
#         [1376.2744, 2056.9260,  898.0439],
#         [ 850.4404,  898.0439, 2509.7507]])

Training Details

Framework Versions

Python: 3.10.10
Sentence Transformers: 5.0.0
Transformers: 4.53.0
PyTorch: 2.7.0+cu128
Accelerate: 1.8.1
Datasets: 3.6.0
Tokenizers: 0.21.2

pulkitmehtawork
/

sparse-distilbert-base-uncased-python-code-lightening