File size: 3,743 Bytes
e2a82bc
 
 
 
 
 
 
deada27
 
0aec2ed
deada27
56f66b5
deada27
0aec2ed
56f66b5
 
 
 
deada27
08e83f3
56f66b5
08e83f3
56f66b5
08e83f3
 
 
 
 
 
 
 
 
 
56f66b5
0aec2ed
56f66b5
08e83f3
 
 
 
 
56f66b5
 
deada27
 
 
56f66b5
 
deada27
56f66b5
deada27
 
 
56f66b5
 
deada27
 
56f66b5
 
 
 
 
 
 
 
0aec2ed
69a842e
56f66b5
08e83f3
56f66b5
deada27
 
56f66b5
 
 
0aec2ed
56f66b5
 
0aec2ed
56f66b5
 
 
 
0aec2ed
56f66b5
 
 
 
 
 
 
deada27
69a842e
e2a82bc
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: gpl-3.0
language:
- en
base_model:
- google-bert/bert-base-uncased
---
# PatentBERT - PyTorch

BERT model specialized for patent classification using the **CPC (Cooperative Patent Classification) system**. (PyTorch version of the original [PatentBert](https://github.com/jiehsheng/PatentBERT/) model.)

## πŸ“Š Specifications

- **Output classes**: 656 (CPC subclass labels)
- **Classification system**: CPC (Cooperative Patent Classification)
- **Architecture**: BERT-base (768 hidden, 12 layers, 12 attention heads)
- **Vocabulary**: 30,522 tokens
- **Format**: SafeTensors

## 🏷️ CPC Classes (Real Distribution)

The model predicts classes according to the authentic CPC system used in PatentBERT training:

### Main Sections (Actual Counts)
- **A (84 classes)**: Human Necessities - Agriculture, Food, Health, Sports
- **B (171 classes)**: Performing Operations; Transporting - Manufacturing, Transport
- **C (88 classes)**: Chemistry; Metallurgy - Chemical processes, Materials
- **D (40 classes)**: Textiles; Paper - Fibers, Fabrics, Paper-making
- **E (31 classes)**: Fixed Constructions - Building, Mining, Roads
- **F (101 classes)**: Mechanical Engineering; Lightning; Heating; Weapons; Blasting
- **G (81 classes)**: Physics - Optics, Acoustics, Computing, Measuring
- **H (51 classes)**: Electricity - Electronics, Power generation, Communication
- **Y (9 classes)**: General Tagging of New Technological Developments

### Example of CPC Subclasses

- `A01B`: SOIL WORKING IN AGRICULTURE OR FORESTRY
- `B25J`: MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- `C07D`: HETEROCYCLIC COMPOUNDS
- `G06F`: ELECTRIC DIGITAL DATA PROCESSING
- `H04L`: TRANSMISSION OF DIGITAL INFORMATION

## πŸš€ Usage

```python
from transformers import BertForSequenceClassification, BertTokenizer
import json
import torch

# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch')
tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch')

# Inference example
text = "A method for producing synthetic materials with enhanced thermal properties..."
inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits.softmax(dim=-1)

# Get prediction
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()

# Use model labels (CPC codes)
predicted_label = model.config.id2label[str(predicted_class_id)]

print(f"Predicted CPC class: {predicted_label} (ID: {predicted_class_id})")
print(f"Confidence: {confidence:.2%}")
```

## πŸ“ Included Files

- `model.safetensors`: Model weights (420 MB)
- `config.json`: Configuration with integrated CPC labels
- `vocab.txt`: Tokenizer vocabulary
- `tokenizer_config.json`: Tokenizer configuration
- `labels.json`: Complete CPC label mapping (656 authentic labels)
- `README.md`: This documentation

## πŸ”¬ Performance

This model was trained on a large patent corpus to automatically classify documents according to the CPC system, using the exact same 656 CPC codes from the original PatentBERT training data.

## πŸ“– References

- [Cooperative Patent Classification (CPC)](https://www.cooperativepatentclassification.org/)
- [Original PatentBERT Paper](https://arxiv.org/abs/2103.02557)

## πŸ“ Citation

If you use this model, please cite the original PatentBERT work and mention this PyTorch conversion.
```
@article{patent_bert, 
  author = "Jieh-Sheng Lee and Jieh Hsiang",
  title = "{PatentBERT: Patent classification with fine-tuning a pre-trained BERT model}",
  journal = "World Patent Information",
  volume = "61",
  number = "101965",
  year = "2020",
}
```