| # PlasmidGPT (Addgene GPT-2 Compatible Version) | |
| This is a **compatibility-enhanced version** of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure. | |
| ## π¬ About PlasmidGPT | |
| PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from [Addgene](https://www.addgene.org/). It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids. | |
| **Original work:** [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) | |
| **Original repository:** [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) | |
| **Original model:** [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT) | |
| ### Key Features | |
| - **Novel Sequence Generation**: Generates novel plasmid sequences rather than replicating training data | |
| - **Conditional Generation**: Supports generation based on user-specified starting sequences | |
| - **Versatile Predictions**: Predicts sequence-related attributes including lab of origin, species, and vector type | |
| - **Transformer Architecture**: Decoder-only transformer with 12 layers and 110 million parameters | |
| ## π Differences from Original | |
| This version provides: | |
| - β Native HuggingFace `transformers` compatibility (no custom loading required) | |
| - β Standard model format (`model.safetensors` instead of `.pt`) | |
| - β Direct `AutoModel` and `AutoTokenizer` support | |
| - β Simplified installation and usage | |
| ## π¦ Installation | |
| ```bash | |
| pip install torch transformers | |
| ``` | |
| ## π Quick Start | |
| ### Basic Sequence Generation | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| device = 'cuda' if torch.cuda.is_available() else 'cpu' | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "McClain/plasmidgpt-addgene-gpt2", | |
| trust_remote_code=True | |
| ).to(device) | |
| model.eval() | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "McClain/plasmidgpt-addgene-gpt2", | |
| trust_remote_code=True | |
| ) | |
| start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT' | |
| input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device) | |
| outputs = model.generate( | |
| input_ids, | |
| max_length=300, | |
| num_return_sequences=1, | |
| temperature=1.0, | |
| do_sample=True, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id | |
| ) | |
| generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(f"Generated sequence: {generated_sequence}") | |
| ``` | |
| ### Generate Multiple Sequences | |
| ```python | |
| outputs = model.generate( | |
| input_ids, | |
| max_length=500, | |
| num_return_sequences=5, | |
| temperature=1.2, | |
| do_sample=True, | |
| top_k=50, | |
| top_p=0.95, | |
| pad_token_id=tokenizer.pad_token_id, | |
| eos_token_id=tokenizer.eos_token_id | |
| ) | |
| for i, output in enumerate(outputs): | |
| sequence = tokenizer.decode(output, skip_special_tokens=True) | |
| print(f"Sequence {i+1}: {sequence[:100]}...") | |
| ``` | |
| ### Extract Embeddings | |
| ```python | |
| model.config.output_hidden_states = True | |
| with torch.no_grad(): | |
| input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device) | |
| outputs = model(input_ids) | |
| hidden_states = outputs.hidden_states[-1] | |
| embedding = hidden_states.mean(dim=1).cpu().numpy() | |
| print(f"Embedding shape: {embedding.shape}") | |
| ``` | |
| ## π― Use Cases | |
| - **Plasmid Design**: Generate novel plasmid sequences for synthetic biology applications | |
| - **Sequence Analysis**: Extract meaningful embeddings for downstream ML tasks | |
| - **Feature Prediction**: Predict properties like lab of origin, species, or vector type | |
| - **Conditional Generation**: Create sequences starting from specific promoters or genes | |
| ## π Model Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | **Architecture** | GPT-2 (Decoder-only Transformer) | | |
| | **Parameters** | 110 million | | |
| | **Layers** | 12 | | |
| | **Hidden Size** | 768 | | |
| | **Attention Heads** | 12 | | |
| | **Context Length** | 2048 tokens | | |
| | **Vocabulary Size** | 30,002 | | |
| | **Training Data** | 153k Addgene plasmid sequences | | |
| ## π Citation | |
| If you use this model, please cite the original PlasmidGPT paper: | |
| ```bibtex | |
| @article{shao2024plasmidgpt, | |
| title={PlasmidGPT: a generative framework for plasmid design and annotation}, | |
| author={Shao, Bin and others}, | |
| journal={bioRxiv}, | |
| year={2024}, | |
| doi={10.1101/2024.09.30.615762}, | |
| url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1} | |
| } | |
| ``` | |
| ## π License | |
| This model inherits the license from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for licensing details. | |
| ## π Credits | |
| **Original Author:** Bin Shao (lingxusb) | |
| **Original Work:** [PlasmidGPT GitHub Repository](https://github.com/lingxusb/PlasmidGPT) | |
| **Paper:** [bioRxiv 2024.09.30.615762](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) | |
| This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model. | |
| ## π Related Resources | |
| - [Original PlasmidGPT Repository](https://github.com/lingxusb/PlasmidGPT) | |
| - [Original HuggingFace Model](https://huggingface.co/lingxusb/PlasmidGPT) | |
| - [PlasmidGPT Paper (bioRxiv)](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) | |
| - [Addgene Plasmid Repository](https://www.addgene.org/) | |
| ## β οΈ Notes | |
| - The model generates DNA sequences for research purposes | |
| - Generated sequences should be validated before experimental use | |
| - The model was trained on Addgene plasmids and performs best on similar sequence types | |
| - For prediction tasks (lab, species, vector type), refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for prediction model weights | |