patentbert-pytorch / README.md

Update README.md

0aec2ed verified 22 days ago

3.74 kB

	---
	license: gpl-3.0
	language:
	- en
	base_model:
	- google-bert/bert-base-uncased
	---
	# PatentBERT - PyTorch

	BERT model specialized for patent classification using the CPC (Cooperative Patent Classification) system. (PyTorch version of the original [PatentBert](https://github.com/jiehsheng/PatentBERT/) model.)

	## 📊 Specifications

	- Output classes: 656 (CPC subclass labels)
	- Classification system: CPC (Cooperative Patent Classification)
	- Architecture: BERT-base (768 hidden, 12 layers, 12 attention heads)
	- Vocabulary: 30,522 tokens
	- Format: SafeTensors

	## 🏷️ CPC Classes (Real Distribution)

	The model predicts classes according to the authentic CPC system used in PatentBERT training:

	### Main Sections (Actual Counts)
	- A (84 classes): Human Necessities - Agriculture, Food, Health, Sports
	- B (171 classes): Performing Operations; Transporting - Manufacturing, Transport
	- C (88 classes): Chemistry; Metallurgy - Chemical processes, Materials
	- D (40 classes): Textiles; Paper - Fibers, Fabrics, Paper-making
	- E (31 classes): Fixed Constructions - Building, Mining, Roads
	- F (101 classes): Mechanical Engineering; Lightning; Heating; Weapons; Blasting
	- G (81 classes): Physics - Optics, Acoustics, Computing, Measuring
	- H (51 classes): Electricity - Electronics, Power generation, Communication
	- Y (9 classes): General Tagging of New Technological Developments

	### Example of CPC Subclasses

	- `A01B`: SOIL WORKING IN AGRICULTURE OR FORESTRY
	- `B25J`: MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
	- `C07D`: HETEROCYCLIC COMPOUNDS
	- `G06F`: ELECTRIC DIGITAL DATA PROCESSING
	- `H04L`: TRANSMISSION OF DIGITAL INFORMATION

	## 🚀 Usage

	```python
	from transformers import BertForSequenceClassification, BertTokenizer
	import json
	import torch

	# Load model and tokenizer
	model = BertForSequenceClassification.from_pretrained('ZoeYou/patentbert-pytorch')
	tokenizer = BertTokenizer.from_pretrained('ZoeYou/patentbert-pytorch')

	# Inference example
	text = "A method for producing synthetic materials with enhanced thermal properties..."
	inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = outputs.logits.softmax(dim=-1)

	# Get prediction
	predicted_class_id = predictions.argmax().item()
	confidence = predictions.max().item()

	# Use model labels (CPC codes)
	predicted_label = model.config.id2label[str(predicted_class_id)]

	print(f"Predicted CPC class: {predicted_label} (ID: {predicted_class_id})")
	print(f"Confidence: {confidence:.2%}")
	```

	## 📁 Included Files

	- `model.safetensors`: Model weights (420 MB)
	- `config.json`: Configuration with integrated CPC labels
	- `vocab.txt`: Tokenizer vocabulary
	- `tokenizer_config.json`: Tokenizer configuration
	- `labels.json`: Complete CPC label mapping (656 authentic labels)
	- `README.md`: This documentation

	## 🔬 Performance

	This model was trained on a large patent corpus to automatically classify documents according to the CPC system, using the exact same 656 CPC codes from the original PatentBERT training data.

	## 📖 References

	- [Cooperative Patent Classification (CPC)](https://www.cooperativepatentclassification.org/)
	- [Original PatentBERT Paper](https://arxiv.org/abs/2103.02557)

	## 📝 Citation

	If you use this model, please cite the original PatentBERT work and mention this PyTorch conversion.
	```
	@article{patent_bert,
	author = "Jieh-Sheng Lee and Jieh Hsiang",
	title = "{PatentBERT: Patent classification with fine-tuning a pre-trained BERT model}",
	journal = "World Patent Information",
	volume = "61",
	number = "101965",
	year = "2020",
	}
	```