Add comprehensive README with citations to original PlasmidGPT

Browse files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (8) hide show

.gitattributes +0 -35
README.md +165 -0
config.json +0 -38
generation_config.json +0 -7
model.safetensors +0 -3
special_tokens_map.json +0 -23
tokenizer.json +0 -0
tokenizer_config.json +0 -67

.gitattributes DELETED Viewed

@@ -1,35 +0,0 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,165 @@

+# PlasmidGPT (Addgene GPT-2 Compatible Version)
+This is a **compatibility-enhanced version** of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure.
+## 🔬 About PlasmidGPT
+PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from [Addgene](https://www.addgene.org/). It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids.
+**Original work:** [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
+**Original repository:** [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT)
+**Original model:** [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT)
+### Key Features
+- **Novel Sequence Generation**: Generates novel plasmid sequences rather than replicating training data
+- **Conditional Generation**: Supports generation based on user-specified starting sequences
+- **Versatile Predictions**: Predicts sequence-related attributes including lab of origin, species, and vector type
+- **Transformer Architecture**: Decoder-only transformer with 12 layers and 110 million parameters
+## 🆚 Differences from Original
+This version provides:
+- ✅ Native HuggingFace `transformers` compatibility (no custom loading required)
+- ✅ Standard model format (`model.safetensors` instead of `.pt`)
+- ✅ Direct `AutoModel` and `AutoTokenizer` support
+- ✅ Simplified installation and usage
+## 📦 Installation
+```bash
+pip install torch transformers
+```
+## 🚀 Quick Start
+### Basic Sequence Generation
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+model = AutoModelForCausalLM.from_pretrained(
+    "McClain/plasmidgpt-addgene-gpt2",
+    trust_remote_code=True
+).to(device)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(
+    "McClain/plasmidgpt-addgene-gpt2",
+    trust_remote_code=True
+)
+start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT'
+input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)
+outputs = model.generate(
+    input_ids,
+    max_length=300,
+    num_return_sequences=1,
+    temperature=1.0,
+    do_sample=True,
+    pad_token_id=tokenizer.pad_token_id,
+    eos_token_id=tokenizer.eos_token_id
+)
+generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(f"Generated sequence: {generated_sequence}")
+```
+### Generate Multiple Sequences
+```python
+outputs = model.generate(
+    input_ids,
+    max_length=500,
+    num_return_sequences=5,
+    temperature=1.2,
+    do_sample=True,
+    top_k=50,
+    top_p=0.95,
+    pad_token_id=tokenizer.pad_token_id,
+    eos_token_id=tokenizer.eos_token_id
+)
+for i, output in enumerate(outputs):
+    sequence = tokenizer.decode(output, skip_special_tokens=True)
+    print(f"Sequence {i+1}: {sequence[:100]}...")
+```
+### Extract Embeddings
+```python
+model.config.output_hidden_states = True
+with torch.no_grad():
+    input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device)
+    outputs = model(input_ids)
+    hidden_states = outputs.hidden_states[-1]
+    embedding = hidden_states.mean(dim=1).cpu().numpy()
+print(f"Embedding shape: {embedding.shape}")
+```
+## 🎯 Use Cases
+- **Plasmid Design**: Generate novel plasmid sequences for synthetic biology applications
+- **Sequence Analysis**: Extract meaningful embeddings for downstream ML tasks
+- **Feature Prediction**: Predict properties like lab of origin, species, or vector type
+- **Conditional Generation**: Create sequences starting from specific promoters or genes
+## 📊 Model Details
+| Parameter | Value |
+|-----------|-------|
+| **Architecture** | GPT-2 (Decoder-only Transformer) |
+| **Parameters** | 110 million |
+| **Layers** | 12 |
+| **Hidden Size** | 768 |
+| **Attention Heads** | 12 |
+| **Context Length** | 2048 tokens |
+| **Vocabulary Size** | 30,002 |
+| **Training Data** | 153k Addgene plasmid sequences |
+## 📚 Citation
+If you use this model, please cite the original PlasmidGPT paper:
+```bibtex
+@article{shao2024plasmidgpt,
+  title={PlasmidGPT: a generative framework for plasmid design and annotation},
+  author={Shao, Bin and others},
+  journal={bioRxiv},
+  year={2024},
+  doi={10.1101/2024.09.30.615762},
+  url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1}
+}
+```
+## 📄 License
+This model inherits the license from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for licensing details.
+## 🙏 Credits
+**Original Author:** Bin Shao (lingxusb)
+**Original Work:** [PlasmidGPT GitHub Repository](https://github.com/lingxusb/PlasmidGPT)
+**Paper:** [bioRxiv 2024.09.30.615762](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
+This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model.
+## 🔗 Related Resources
+- [Original PlasmidGPT Repository](https://github.com/lingxusb/PlasmidGPT)
+- [Original HuggingFace Model](https://huggingface.co/lingxusb/PlasmidGPT)
+- [PlasmidGPT Paper (bioRxiv)](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
+- [Addgene Plasmid Repository](https://www.addgene.org/)
+## ⚠️ Notes
+- The model generates DNA sequences for research purposes
+- Generated sequences should be validated before experimental use
+- The model was trained on Addgene plasmids and performs best on similar sequence types
+- For prediction tasks (lab, species, vector type), refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for prediction model weights

config.json DELETED Viewed

@@ -1,38 +0,0 @@
-{
-  "activation_function": "gelu_new",
-  "architectures": [
-    "GPT2LMHeadModel"
-  ],
-  "attn_pdrop": 0.1,
-  "bos_token_id": 50256,
-  "embd_pdrop": 0.1,
-  "eos_token_id": 2,
-  "initializer_range": 0.02,
-  "layer_norm_epsilon": 1e-05,
-  "model_type": "gpt2",
-  "n_ctx": 2048,
-  "n_embd": 768,
-  "n_head": 12,
-  "n_inner": null,
-  "n_layer": 12,
-  "n_positions": 2048,
-  "reorder_and_upcast_attn": false,
-  "resid_pdrop": 0.1,
-  "scale_attn_by_inverse_layer_idx": false,
-  "scale_attn_weights": true,
-  "summary_activation": null,
-  "summary_first_dropout": 0.1,
-  "summary_proj_to_labels": true,
-  "summary_type": "cls_index",
-  "summary_use_proj": true,
-  "task_specific_params": {
-    "text-generation": {
-      "do_sample": true,
-      "max_length": 50
-    }
-  },
-  "torch_dtype": "float32",
-  "transformers_version": "4.55.4",
-  "use_cache": true,
-  "vocab_size": 30002
-}

generation_config.json DELETED Viewed

@@ -1,7 +0,0 @@
-{
-  "_from_model_config": true,
-  "bos_token_id": 50256,
-  "eos_token_id": 2,
-  "pad_token_id": 3,
-  "transformers_version": "4.55.4"
-}

model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:02365cb1aa0e86be0439567e7a8d15b51234dfdc051025ab1b2e4b157923debb
-size 438696576

special_tokens_map.json DELETED Viewed

@@ -1,23 +0,0 @@
-{
-  "bos_token": {
-    "content": "<s>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "eos_token": {
-    "content": "[SEP]",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "[PAD]",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  }
-}

tokenizer.json DELETED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json DELETED Viewed

@@ -1,67 +0,0 @@
-{
-  "added_tokens_decoder": {
-    "0": {
-      "content": "[UNK]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "1": {
-      "content": "[CLS]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "2": {
-      "content": "[SEP]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "3": {
-      "content": "[PAD]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "4": {
-      "content": "[MASK]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "30000": {
-      "content": "<s>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "30001": {
-      "content": "</s>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    }
-  },
-  "bos_token": "<s>",
-  "clean_up_tokenization_spaces": false,
-  "eos_token": "[SEP]",
-  "extra_special_tokens": {},
-  "model_max_length": 1000000000000000019884624838656,
-  "pad_token": "[PAD]",
-  "tokenizer_class": "PreTrainedTokenizerFast"
-}