PlasmidGPT / README.md

Add comprehensive README with citations to original PlasmidGPT

3c3d7e9 29 days ago

6.09 kB

	# PlasmidGPT (Addgene GPT-2 Compatible Version)

	This is a compatibility-enhanced version of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure.

	## 🔬 About PlasmidGPT

	PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from [Addgene](https://www.addgene.org/). It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids.

	Original work: [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
	Original repository: [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT)
	Original model: [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT)

	### Key Features

	- Novel Sequence Generation: Generates novel plasmid sequences rather than replicating training data
	- Conditional Generation: Supports generation based on user-specified starting sequences
	- Versatile Predictions: Predicts sequence-related attributes including lab of origin, species, and vector type
	- Transformer Architecture: Decoder-only transformer with 12 layers and 110 million parameters

	## 🆚 Differences from Original

	This version provides:
	- ✅ Native HuggingFace `transformers` compatibility (no custom loading required)
	- ✅ Standard model format (`model.safetensors` instead of `.pt`)
	- ✅ Direct `AutoModel` and `AutoTokenizer` support
	- ✅ Simplified installation and usage

	## 📦 Installation

	```bash
	pip install torch transformers
	```

	## 🚀 Quick Start

	### Basic Sequence Generation

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	device = 'cuda' if torch.cuda.is_available() else 'cpu'

	model = AutoModelForCausalLM.from_pretrained(
	"McClain/plasmidgpt-addgene-gpt2",
	trust_remote_code=True
	).to(device)
	model.eval()

	tokenizer = AutoTokenizer.from_pretrained(
	"McClain/plasmidgpt-addgene-gpt2",
	trust_remote_code=True
	)

	start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT'
	input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)

	outputs = model.generate(
	input_ids,
	max_length=300,
	num_return_sequences=1,
	temperature=1.0,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id
	)

	generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(f"Generated sequence: {generated_sequence}")
	```

	### Generate Multiple Sequences

	```python
	outputs = model.generate(
	input_ids,
	max_length=500,
	num_return_sequences=5,
	temperature=1.2,
	do_sample=True,
	top_k=50,
	top_p=0.95,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id
	)

	for i, output in enumerate(outputs):
	sequence = tokenizer.decode(output, skip_special_tokens=True)
	print(f"Sequence {i+1}: {sequence[:100]}...")
	```

	### Extract Embeddings

	```python
	model.config.output_hidden_states = True

	with torch.no_grad():
	input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device)
	outputs = model(input_ids)
	hidden_states = outputs.hidden_states[-1]
	embedding = hidden_states.mean(dim=1).cpu().numpy()

	print(f"Embedding shape: {embedding.shape}")
	```

	## 🎯 Use Cases

	- Plasmid Design: Generate novel plasmid sequences for synthetic biology applications
	- Sequence Analysis: Extract meaningful embeddings for downstream ML tasks
	- Feature Prediction: Predict properties like lab of origin, species, or vector type
	- Conditional Generation: Create sequences starting from specific promoters or genes

	## 📊 Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| GPT-2 (Decoder-only Transformer) \|
	\| Parameters \| 110 million \|
	\| Layers \| 12 \|
	\| Hidden Size \| 768 \|
	\| Attention Heads \| 12 \|
	\| Context Length \| 2048 tokens \|
	\| Vocabulary Size \| 30,002 \|
	\| Training Data \| 153k Addgene plasmid sequences \|

	## 📚 Citation

	If you use this model, please cite the original PlasmidGPT paper:

	```bibtex
	@article{shao2024plasmidgpt,
	title={PlasmidGPT: a generative framework for plasmid design and annotation},
	author={Shao, Bin and others},
	journal={bioRxiv},
	year={2024},
	doi={10.1101/2024.09.30.615762},
	url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1}
	}
	```

	## 📄 License

	This model inherits the license from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for licensing details.

	## 🙏 Credits

	Original Author: Bin Shao (lingxusb)
	Original Work: [PlasmidGPT GitHub Repository](https://github.com/lingxusb/PlasmidGPT)
	Paper: [bioRxiv 2024.09.30.615762](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)

	This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model.

	## 🔗 Related Resources

	- [Original PlasmidGPT Repository](https://github.com/lingxusb/PlasmidGPT)
	- [Original HuggingFace Model](https://huggingface.co/lingxusb/PlasmidGPT)
	- [PlasmidGPT Paper (bioRxiv)](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
	- [Addgene Plasmid Repository](https://www.addgene.org/)

	## ⚠️ Notes

	- The model generates DNA sequences for research purposes
	- Generated sequences should be validated before experimental use
	- The model was trained on Addgene plasmids and performs best on similar sequence types
	- For prediction tasks (lab, species, vector type), refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for prediction model weights