PlasmidGPT (Addgene GPT-2 Compatible Version)
This is a compatibility-enhanced version of PlasmidGPT by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure.
π¬ About PlasmidGPT
PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from Addgene. It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids.
Original work: PlasmidGPT: a generative framework for plasmid design and annotation
Original repository: github.com/lingxusb/PlasmidGPT
Original model: huggingface.co/lingxusb/PlasmidGPT
Key Features
- Novel Sequence Generation: Generates novel plasmid sequences rather than replicating training data
- Conditional Generation: Supports generation based on user-specified starting sequences
- Versatile Predictions: Predicts sequence-related attributes including lab of origin, species, and vector type
- Transformer Architecture: Decoder-only transformer with 12 layers and 110 million parameters
π Differences from Original
This version provides:
- β
Native HuggingFace
transformerscompatibility (no custom loading required) - β
Standard model format (
model.safetensorsinstead of.pt) - β
Direct
AutoModelandAutoTokenizersupport - β Simplified installation and usage
π¦ Installation
pip install torch transformers
π Quick Start
Basic Sequence Generation
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained(
"McClain/plasmidgpt-addgene-gpt2",
trust_remote_code=True
).to(device)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
"McClain/plasmidgpt-addgene-gpt2",
trust_remote_code=True
)
start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT'
input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)
outputs = model.generate(
input_ids,
max_length=300,
num_return_sequences=1,
temperature=1.0,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated sequence: {generated_sequence}")
Generate Multiple Sequences
outputs = model.generate(
input_ids,
max_length=500,
num_return_sequences=5,
temperature=1.2,
do_sample=True,
top_k=50,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
for i, output in enumerate(outputs):
sequence = tokenizer.decode(output, skip_special_tokens=True)
print(f"Sequence {i+1}: {sequence[:100]}...")
Extract Embeddings
model.config.output_hidden_states = True
with torch.no_grad():
input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device)
outputs = model(input_ids)
hidden_states = outputs.hidden_states[-1]
embedding = hidden_states.mean(dim=1).cpu().numpy()
print(f"Embedding shape: {embedding.shape}")
π― Use Cases
- Plasmid Design: Generate novel plasmid sequences for synthetic biology applications
- Sequence Analysis: Extract meaningful embeddings for downstream ML tasks
- Feature Prediction: Predict properties like lab of origin, species, or vector type
- Conditional Generation: Create sequences starting from specific promoters or genes
π Model Details
| Parameter | Value |
|---|---|
| Architecture | GPT-2 (Decoder-only Transformer) |
| Parameters | 110 million |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Context Length | 2048 tokens |
| Vocabulary Size | 30,002 |
| Training Data | 153k Addgene plasmid sequences |
π Citation
If you use this model, please cite the original PlasmidGPT paper:
@article{shao2024plasmidgpt,
title={PlasmidGPT: a generative framework for plasmid design and annotation},
author={Shao, Bin and others},
journal={bioRxiv},
year={2024},
doi={10.1101/2024.09.30.615762},
url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1}
}
π License
This model inherits the license from the original PlasmidGPT repository. Please refer to the original repository for licensing details.
π Credits
Original Author: Bin Shao (lingxusb)
Original Work: PlasmidGPT GitHub Repository
Paper: bioRxiv 2024.09.30.615762
This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model.
π Related Resources
- Original PlasmidGPT Repository
- Original HuggingFace Model
- PlasmidGPT Paper (bioRxiv)
- Addgene Plasmid Repository
β οΈ Notes
- The model generates DNA sequences for research purposes
- Generated sequences should be validated before experimental use
- The model was trained on Addgene plasmids and performs best on similar sequence types
- For prediction tasks (lab, species, vector type), refer to the original repository for prediction model weights
- Downloads last month
- 325