You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

patembed-base

This is a sentence-transformers model trained specifically for patent text embeddings. It is part of the PatenTEB project, which provides state-of-the-art models for patent document understanding and retrieval.

Note: This model uses task-specific instruction prompts during inference for optimal performance.

Model Details

Model Type: Sentence Transformer
Base Architecture: Distilled from patembed-large using layers {0,2,4,6,8,10,12,14,16,18,20,22}
Parameters: 193M
Number of Layers: 12
Hidden Size: 1024
Embedding Dimension: 768
Max Sequence Length: 512 tokens
Language: English
License: CC BY-NC-SA 4.0

Model Description

Primary deployment target distilled from patembed-large. Maintains 1024 hidden size with projection to 768-dim embeddings.

This model is part of the patembed family, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper.

Usage

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('datalyes/patembed-base')

# Encode patent texts
patent_texts = [
    "A method for manufacturing semiconductor devices...",
    "An apparatus for processing chemical compounds...",
]
embeddings = model.encode(patent_texts)

# Compute similarity
from sentence_transformers import util
similarity = util.cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Using Transformers

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-base')
model = AutoModel.from_pretrained('datalyes/patembed-base')

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Tokenize and encode
texts = ["A method for manufacturing semiconductor devices..."]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded)
    embeddings = mean_pooling(model_output, encoded['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)

Patent Retrieval Example

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('datalyes/patembed-base')

# Query patent
query = "Method for reducing power consumption in mobile devices"

# Candidate patents
candidates = [
    "A power management system for portable electronic devices...",
    "Chemical composition for battery manufacturing...",
    "Method for wireless data transmission in mobile networks...",
]

# Encode and retrieve
query_emb = model.encode(query)
candidate_embs = model.encode(candidates)

# Compute similarities
scores = util.cos_sim(query_emb, candidate_embs)[0]

# Get ranked results
results = [(candidates[i], scores[i].item()) for i in range(len(candidates))]
results.sort(key=lambda x: x[1], reverse=True)

for patent, score in results:
    print(f"Score: {score:.4f} - {patent[:100]}...")

Intended Use

This model is designed for patent-specific tasks including:

Patent search and retrieval
Prior art search
Patent classification and clustering
Technology landscape analysis

For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper.

Citation

If you use this model, please cite our paper:

@misc{ayaou2025patentebcomprehensivebenchmarkmodel,
      title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, 
      author={Iliass Ayaou and Denis Cavallucci},
      year={2025},
      eprint={2510.22264},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.22264}
}

Paper: PatenTEB on arXiv

License

This model is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Key Terms:

✅ You can use, share, and adapt the model
✅ You must give appropriate credit
❌ You may not use the model for commercial purposes
⚠️ If you adapt or build upon this model, you must distribute under the same license

For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/

Contact

Authors: Iliass Ayaou, Denis Cavallucci
Institution: ICUBE Laboratory, INSA Strasbourg
GitHub: PatentTEB/PatentTEB
HuggingFace: datalyes

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Collection including datalyes/patembed-base

patembed models collection

Collection

Collection grouping the various models of the patembed family. • 12 items • Updated 3 days ago