You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

patembed-base

This is a sentence-transformers model trained specifically for patent text embeddings. It is part of the PatenTEB project, which provides state-of-the-art models for patent document understanding and retrieval.

Note: This model uses task-specific instruction prompts during inference for optimal performance.

Model Details

  • Model Type: Sentence Transformer
  • Base Architecture: Distilled from patembed-large using layers {0,2,4,6,8,10,12,14,16,18,20,22}
  • Parameters: 193M
  • Number of Layers: 12
  • Hidden Size: 1024
  • Embedding Dimension: 768
  • Max Sequence Length: 512 tokens
  • Language: English
  • License: CC BY-NC-SA 4.0

Model Description

Primary deployment target distilled from patembed-large. Maintains 1024 hidden size with projection to 768-dim embeddings.

This model is part of the patembed family, developed through multi-task learning on 13 training tasks from the PatenTEB benchmark. For detailed information about the training methodology, architecture, and comprehensive evaluation results, please refer to our paper.

Usage

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('datalyes/patembed-base')

# Encode patent texts
patent_texts = [
    "A method for manufacturing semiconductor devices...",
    "An apparatus for processing chemical compounds...",
]
embeddings = model.encode(patent_texts)

# Compute similarity
from sentence_transformers import util
similarity = util.cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Using Transformers

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('datalyes/patembed-base')
model = AutoModel.from_pretrained('datalyes/patembed-base')

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Tokenize and encode
texts = ["A method for manufacturing semiconductor devices..."]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded)
    embeddings = mean_pooling(model_output, encoded['attention_mask'])
    embeddings = F.normalize(embeddings, p=2, dim=1)

Patent Retrieval Example

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('datalyes/patembed-base')

# Query patent
query = "Method for reducing power consumption in mobile devices"

# Candidate patents
candidates = [
    "A power management system for portable electronic devices...",
    "Chemical composition for battery manufacturing...",
    "Method for wireless data transmission in mobile networks...",
]

# Encode and retrieve
query_emb = model.encode(query)
candidate_embs = model.encode(candidates)

# Compute similarities
scores = util.cos_sim(query_emb, candidate_embs)[0]

# Get ranked results
results = [(candidates[i], scores[i].item()) for i in range(len(candidates))]
results.sort(key=lambda x: x[1], reverse=True)

for patent, score in results:
    print(f"Score: {score:.4f} - {patent[:100]}...")

Intended Use

This model is designed for patent-specific tasks including:

  • Patent search and retrieval
  • Prior art search
  • Patent classification and clustering
  • Technology landscape analysis

For detailed training methodology, evaluation protocols, and performance analysis, please refer to our paper.

Citation

If you use this model, please cite our paper:

@misc{ayaou2025patentebcomprehensivebenchmarkmodel,
      title={PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding}, 
      author={Iliass Ayaou and Denis Cavallucci},
      year={2025},
      eprint={2510.22264},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.22264}
}

Paper: PatenTEB on arXiv

License

This model is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Key Terms:

  • โœ… You can use, share, and adapt the model
  • โœ… You must give appropriate credit
  • โŒ You may not use the model for commercial purposes
  • โš ๏ธ If you adapt or build upon this model, you must distribute under the same license

For full license details: https://creativecommons.org/licenses/by-nc-sa/4.0/

Contact

  • Authors: Iliass Ayaou, Denis Cavallucci
  • Institution: ICUBE Laboratory, INSA Strasbourg
  • GitHub: PatentTEB/PatentTEB
  • HuggingFace: datalyes
Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including datalyes/patembed-base