|
--- |
|
license: gpl-3.0 |
|
language: |
|
- en |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
--- |
|
# PatentBERT - PyTorch |
|
|
|
BERT model specialized for patent classification using the **CPC (Cooperative Patent Classification) system**. (PyTorch version of the original [PatentBert](https://github.com/jiehsheng/PatentBERT/) model.) |
|
|
|
## π Specifications |
|
|
|
- **Output classes**: 656 (CPC subclass labels) |
|
- **Classification system**: CPC (Cooperative Patent Classification) |
|
- **Architecture**: BERT-base (768 hidden, 12 layers, 12 attention heads) |
|
- **Vocabulary**: 30,522 tokens |
|
- **Format**: SafeTensors |
|
|
|
## π·οΈ CPC Classes (Real Distribution) |
|
|
|
The model predicts classes according to the authentic CPC system used in PatentBERT training: |
|
|
|
### Main Sections (Actual Counts) |
|
- **A (84 classes)**: Human Necessities - Agriculture, Food, Health, Sports |
|
- **B (171 classes)**: Performing Operations; Transporting - Manufacturing, Transport |
|
- **C (88 classes)**: Chemistry; Metallurgy - Chemical processes, Materials |
|
- **D (40 classes)**: Textiles; Paper - Fibers, Fabrics, Paper-making |
|
- **E (31 classes)**: Fixed Constructions - Building, Mining, Roads |
|
- **F (101 classes)**: Mechanical Engineering; Lightning; Heating; Weapons; Blasting |
|
- **G (81 classes)**: Physics - Optics, Acoustics, Computing, Measuring |
|
- **H (51 classes)**: Electricity - Electronics, Power generation, Communication |
|
- **Y (9 classes)**: General Tagging of New Technological Developments |
|
|
|
### Example of CPC Subclasses |
|
|
|
- `A01B`: SOIL WORKING IN AGRICULTURE OR FORESTRY |
|
- `B25J`: MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES |
|
- `C07D`: HETEROCYCLIC COMPOUNDS |
|
- `G06F`: ELECTRIC DIGITAL DATA PROCESSING |
|
- `H04L`: TRANSMISSION OF DIGITAL INFORMATION |
|
|
|
## π Usage |
|
|
|
```python |
|
from transformers import BertForSequenceClassification, BertTokenizer |
|
import json |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch') |
|
tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch') |
|
|
|
# Inference example |
|
text = "A method for producing synthetic materials with enhanced thermal properties..." |
|
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True) |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = outputs.logits.softmax(dim=-1) |
|
|
|
# Get prediction |
|
predicted_class_id = predictions.argmax().item() |
|
confidence = predictions.max().item() |
|
|
|
# Use model labels (CPC codes) |
|
predicted_label = model.config.id2label[str(predicted_class_id)] |
|
|
|
print(f"Predicted CPC class: {predicted_label} (ID: {predicted_class_id})") |
|
print(f"Confidence: {confidence:.2%}") |
|
``` |
|
|
|
## π Included Files |
|
|
|
- `model.safetensors`: Model weights (420 MB) |
|
- `config.json`: Configuration with integrated CPC labels |
|
- `vocab.txt`: Tokenizer vocabulary |
|
- `tokenizer_config.json`: Tokenizer configuration |
|
- `labels.json`: Complete CPC label mapping (656 authentic labels) |
|
- `README.md`: This documentation |
|
|
|
## π¬ Performance |
|
|
|
This model was trained on a large patent corpus to automatically classify documents according to the CPC system, using the exact same 656 CPC codes from the original PatentBERT training data. |
|
|
|
## π References |
|
|
|
- [Cooperative Patent Classification (CPC)](https://www.cooperativepatentclassification.org/) |
|
- [Original PatentBERT Paper](https://arxiv.org/abs/2103.02557) |
|
|
|
## π Citation |
|
|
|
If you use this model, please cite the original PatentBERT work and mention this PyTorch conversion. |
|
``` |
|
@article{patent_bert, |
|
author = "Jieh-Sheng Lee and Jieh Hsiang", |
|
title = "{PatentBERT: Patent classification with fine-tuning a pre-trained BERT model}", |
|
journal = "World Patent Information", |
|
volume = "61", |
|
number = "101965", |
|
year = "2020", |
|
} |
|
``` |
|
|