PlasmidGPT / README.md
McClain's picture
Add comprehensive README with citations to original PlasmidGPT
3c3d7e9
# PlasmidGPT (Addgene GPT-2 Compatible Version)
This is a **compatibility-enhanced version** of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure.
## πŸ”¬ About PlasmidGPT
PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from [Addgene](https://www.addgene.org/). It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids.
**Original work:** [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
**Original repository:** [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT)
**Original model:** [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT)
### Key Features
- **Novel Sequence Generation**: Generates novel plasmid sequences rather than replicating training data
- **Conditional Generation**: Supports generation based on user-specified starting sequences
- **Versatile Predictions**: Predicts sequence-related attributes including lab of origin, species, and vector type
- **Transformer Architecture**: Decoder-only transformer with 12 layers and 110 million parameters
## πŸ†š Differences from Original
This version provides:
- βœ… Native HuggingFace `transformers` compatibility (no custom loading required)
- βœ… Standard model format (`model.safetensors` instead of `.pt`)
- βœ… Direct `AutoModel` and `AutoTokenizer` support
- βœ… Simplified installation and usage
## πŸ“¦ Installation
```bash
pip install torch transformers
```
## πŸš€ Quick Start
### Basic Sequence Generation
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(
"McClain/plasmidgpt-addgene-gpt2",
trust_remote_code=True
).to(device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
"McClain/plasmidgpt-addgene-gpt2",
trust_remote_code=True
)
start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT'
input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)
outputs = model.generate(
input_ids,
max_length=300,
num_return_sequences=1,
temperature=1.0,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated sequence: {generated_sequence}")
```
### Generate Multiple Sequences
```python
outputs = model.generate(
input_ids,
max_length=500,
num_return_sequences=5,
temperature=1.2,
do_sample=True,
top_k=50,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
for i, output in enumerate(outputs):
sequence = tokenizer.decode(output, skip_special_tokens=True)
print(f"Sequence {i+1}: {sequence[:100]}...")
```
### Extract Embeddings
```python
model.config.output_hidden_states = True
with torch.no_grad():
input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device)
outputs = model(input_ids)
hidden_states = outputs.hidden_states[-1]
embedding = hidden_states.mean(dim=1).cpu().numpy()
print(f"Embedding shape: {embedding.shape}")
```
## 🎯 Use Cases
- **Plasmid Design**: Generate novel plasmid sequences for synthetic biology applications
- **Sequence Analysis**: Extract meaningful embeddings for downstream ML tasks
- **Feature Prediction**: Predict properties like lab of origin, species, or vector type
- **Conditional Generation**: Create sequences starting from specific promoters or genes
## πŸ“Š Model Details
| Parameter | Value |
|-----------|-------|
| **Architecture** | GPT-2 (Decoder-only Transformer) |
| **Parameters** | 110 million |
| **Layers** | 12 |
| **Hidden Size** | 768 |
| **Attention Heads** | 12 |
| **Context Length** | 2048 tokens |
| **Vocabulary Size** | 30,002 |
| **Training Data** | 153k Addgene plasmid sequences |
## πŸ“š Citation
If you use this model, please cite the original PlasmidGPT paper:
```bibtex
@article{shao2024plasmidgpt,
title={PlasmidGPT: a generative framework for plasmid design and annotation},
author={Shao, Bin and others},
journal={bioRxiv},
year={2024},
doi={10.1101/2024.09.30.615762},
url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1}
}
```
## πŸ“„ License
This model inherits the license from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for licensing details.
## πŸ™ Credits
**Original Author:** Bin Shao (lingxusb)
**Original Work:** [PlasmidGPT GitHub Repository](https://github.com/lingxusb/PlasmidGPT)
**Paper:** [bioRxiv 2024.09.30.615762](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model.
## πŸ”— Related Resources
- [Original PlasmidGPT Repository](https://github.com/lingxusb/PlasmidGPT)
- [Original HuggingFace Model](https://huggingface.co/lingxusb/PlasmidGPT)
- [PlasmidGPT Paper (bioRxiv)](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
- [Addgene Plasmid Repository](https://www.addgene.org/)
## ⚠️ Notes
- The model generates DNA sequences for research purposes
- Generated sequences should be validated before experimental use
- The model was trained on Addgene plasmids and performs best on similar sequence types
- For prediction tasks (lab, species, vector type), refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for prediction model weights