patentbert-pytorch / README.md
ZoeYou's picture
Update README.md
0aec2ed verified
---
license: gpl-3.0
language:
- en
base_model:
- google-bert/bert-base-uncased
---
# PatentBERT - PyTorch
BERT model specialized for patent classification using the **CPC (Cooperative Patent Classification) system**. (PyTorch version of the original [PatentBert](https://github.com/jiehsheng/PatentBERT/) model.)
## πŸ“Š Specifications
- **Output classes**: 656 (CPC subclass labels)
- **Classification system**: CPC (Cooperative Patent Classification)
- **Architecture**: BERT-base (768 hidden, 12 layers, 12 attention heads)
- **Vocabulary**: 30,522 tokens
- **Format**: SafeTensors
## 🏷️ CPC Classes (Real Distribution)
The model predicts classes according to the authentic CPC system used in PatentBERT training:
### Main Sections (Actual Counts)
- **A (84 classes)**: Human Necessities - Agriculture, Food, Health, Sports
- **B (171 classes)**: Performing Operations; Transporting - Manufacturing, Transport
- **C (88 classes)**: Chemistry; Metallurgy - Chemical processes, Materials
- **D (40 classes)**: Textiles; Paper - Fibers, Fabrics, Paper-making
- **E (31 classes)**: Fixed Constructions - Building, Mining, Roads
- **F (101 classes)**: Mechanical Engineering; Lightning; Heating; Weapons; Blasting
- **G (81 classes)**: Physics - Optics, Acoustics, Computing, Measuring
- **H (51 classes)**: Electricity - Electronics, Power generation, Communication
- **Y (9 classes)**: General Tagging of New Technological Developments
### Example of CPC Subclasses
- `A01B`: SOIL WORKING IN AGRICULTURE OR FORESTRY
- `B25J`: MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- `C07D`: HETEROCYCLIC COMPOUNDS
- `G06F`: ELECTRIC DIGITAL DATA PROCESSING
- `H04L`: TRANSMISSION OF DIGITAL INFORMATION
## πŸš€ Usage
```python
from transformers import BertForSequenceClassification, BertTokenizer
import json
import torch
# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch')
tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch')
# Inference example
text = "A method for producing synthetic materials with enhanced thermal properties..."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.softmax(dim=-1)
# Get prediction
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()
# Use model labels (CPC codes)
predicted_label = model.config.id2label[str(predicted_class_id)]
print(f"Predicted CPC class: {predicted_label} (ID: {predicted_class_id})")
print(f"Confidence: {confidence:.2%}")
```
## πŸ“ Included Files
- `model.safetensors`: Model weights (420 MB)
- `config.json`: Configuration with integrated CPC labels
- `vocab.txt`: Tokenizer vocabulary
- `tokenizer_config.json`: Tokenizer configuration
- `labels.json`: Complete CPC label mapping (656 authentic labels)
- `README.md`: This documentation
## πŸ”¬ Performance
This model was trained on a large patent corpus to automatically classify documents according to the CPC system, using the exact same 656 CPC codes from the original PatentBERT training data.
## πŸ“– References
- [Cooperative Patent Classification (CPC)](https://www.cooperativepatentclassification.org/)
- [Original PatentBERT Paper](https://arxiv.org/abs/2103.02557)
## πŸ“ Citation
If you use this model, please cite the original PatentBERT work and mention this PyTorch conversion.
```
@article{patent_bert,
author = "Jieh-Sheng Lee and Jieh Hsiang",
title = "{PatentBERT: Patent classification with fine-tuning a pre-trained BERT model}",
journal = "World Patent Information",
volume = "61",
number = "101965",
year = "2020",
}
```