McClain Claude commited on
Commit
3c3d7e9
·
1 Parent(s): b1f231f

Add comprehensive README with citations to original PlasmidGPT

Browse files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

.gitattributes DELETED
@@ -1,35 +0,0 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PlasmidGPT (Addgene GPT-2 Compatible Version)
2
+
3
+ This is a **compatibility-enhanced version** of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure.
4
+
5
+ ## 🔬 About PlasmidGPT
6
+
7
+ PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from [Addgene](https://www.addgene.org/). It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids.
8
+
9
+ **Original work:** [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
10
+ **Original repository:** [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT)
11
+ **Original model:** [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT)
12
+
13
+ ### Key Features
14
+
15
+ - **Novel Sequence Generation**: Generates novel plasmid sequences rather than replicating training data
16
+ - **Conditional Generation**: Supports generation based on user-specified starting sequences
17
+ - **Versatile Predictions**: Predicts sequence-related attributes including lab of origin, species, and vector type
18
+ - **Transformer Architecture**: Decoder-only transformer with 12 layers and 110 million parameters
19
+
20
+ ## 🆚 Differences from Original
21
+
22
+ This version provides:
23
+ - ✅ Native HuggingFace `transformers` compatibility (no custom loading required)
24
+ - ✅ Standard model format (`model.safetensors` instead of `.pt`)
25
+ - ✅ Direct `AutoModel` and `AutoTokenizer` support
26
+ - ✅ Simplified installation and usage
27
+
28
+ ## 📦 Installation
29
+
30
+ ```bash
31
+ pip install torch transformers
32
+ ```
33
+
34
+ ## 🚀 Quick Start
35
+
36
+ ### Basic Sequence Generation
37
+
38
+ ```python
39
+ import torch
40
+ from transformers import AutoTokenizer, AutoModelForCausalLM
41
+
42
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
43
+
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ "McClain/plasmidgpt-addgene-gpt2",
46
+ trust_remote_code=True
47
+ ).to(device)
48
+ model.eval()
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained(
51
+ "McClain/plasmidgpt-addgene-gpt2",
52
+ trust_remote_code=True
53
+ )
54
+
55
+ start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT'
56
+ input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)
57
+
58
+ outputs = model.generate(
59
+ input_ids,
60
+ max_length=300,
61
+ num_return_sequences=1,
62
+ temperature=1.0,
63
+ do_sample=True,
64
+ pad_token_id=tokenizer.pad_token_id,
65
+ eos_token_id=tokenizer.eos_token_id
66
+ )
67
+
68
+ generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
69
+ print(f"Generated sequence: {generated_sequence}")
70
+ ```
71
+
72
+ ### Generate Multiple Sequences
73
+
74
+ ```python
75
+ outputs = model.generate(
76
+ input_ids,
77
+ max_length=500,
78
+ num_return_sequences=5,
79
+ temperature=1.2,
80
+ do_sample=True,
81
+ top_k=50,
82
+ top_p=0.95,
83
+ pad_token_id=tokenizer.pad_token_id,
84
+ eos_token_id=tokenizer.eos_token_id
85
+ )
86
+
87
+ for i, output in enumerate(outputs):
88
+ sequence = tokenizer.decode(output, skip_special_tokens=True)
89
+ print(f"Sequence {i+1}: {sequence[:100]}...")
90
+ ```
91
+
92
+ ### Extract Embeddings
93
+
94
+ ```python
95
+ model.config.output_hidden_states = True
96
+
97
+ with torch.no_grad():
98
+ input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device)
99
+ outputs = model(input_ids)
100
+ hidden_states = outputs.hidden_states[-1]
101
+ embedding = hidden_states.mean(dim=1).cpu().numpy()
102
+
103
+ print(f"Embedding shape: {embedding.shape}")
104
+ ```
105
+
106
+ ## 🎯 Use Cases
107
+
108
+ - **Plasmid Design**: Generate novel plasmid sequences for synthetic biology applications
109
+ - **Sequence Analysis**: Extract meaningful embeddings for downstream ML tasks
110
+ - **Feature Prediction**: Predict properties like lab of origin, species, or vector type
111
+ - **Conditional Generation**: Create sequences starting from specific promoters or genes
112
+
113
+ ## 📊 Model Details
114
+
115
+ | Parameter | Value |
116
+ |-----------|-------|
117
+ | **Architecture** | GPT-2 (Decoder-only Transformer) |
118
+ | **Parameters** | 110 million |
119
+ | **Layers** | 12 |
120
+ | **Hidden Size** | 768 |
121
+ | **Attention Heads** | 12 |
122
+ | **Context Length** | 2048 tokens |
123
+ | **Vocabulary Size** | 30,002 |
124
+ | **Training Data** | 153k Addgene plasmid sequences |
125
+
126
+ ## 📚 Citation
127
+
128
+ If you use this model, please cite the original PlasmidGPT paper:
129
+
130
+ ```bibtex
131
+ @article{shao2024plasmidgpt,
132
+ title={PlasmidGPT: a generative framework for plasmid design and annotation},
133
+ author={Shao, Bin and others},
134
+ journal={bioRxiv},
135
+ year={2024},
136
+ doi={10.1101/2024.09.30.615762},
137
+ url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1}
138
+ }
139
+ ```
140
+
141
+ ## 📄 License
142
+
143
+ This model inherits the license from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for licensing details.
144
+
145
+ ## 🙏 Credits
146
+
147
+ **Original Author:** Bin Shao (lingxusb)
148
+ **Original Work:** [PlasmidGPT GitHub Repository](https://github.com/lingxusb/PlasmidGPT)
149
+ **Paper:** [bioRxiv 2024.09.30.615762](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
150
+
151
+ This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model.
152
+
153
+ ## 🔗 Related Resources
154
+
155
+ - [Original PlasmidGPT Repository](https://github.com/lingxusb/PlasmidGPT)
156
+ - [Original HuggingFace Model](https://huggingface.co/lingxusb/PlasmidGPT)
157
+ - [PlasmidGPT Paper (bioRxiv)](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
158
+ - [Addgene Plasmid Repository](https://www.addgene.org/)
159
+
160
+ ## ⚠️ Notes
161
+
162
+ - The model generates DNA sequences for research purposes
163
+ - Generated sequences should be validated before experimental use
164
+ - The model was trained on Addgene plasmids and performs best on similar sequence types
165
+ - For prediction tasks (lab, species, vector type), refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for prediction model weights
config.json DELETED
@@ -1,38 +0,0 @@
1
- {
2
- "activation_function": "gelu_new",
3
- "architectures": [
4
- "GPT2LMHeadModel"
5
- ],
6
- "attn_pdrop": 0.1,
7
- "bos_token_id": 50256,
8
- "embd_pdrop": 0.1,
9
- "eos_token_id": 2,
10
- "initializer_range": 0.02,
11
- "layer_norm_epsilon": 1e-05,
12
- "model_type": "gpt2",
13
- "n_ctx": 2048,
14
- "n_embd": 768,
15
- "n_head": 12,
16
- "n_inner": null,
17
- "n_layer": 12,
18
- "n_positions": 2048,
19
- "reorder_and_upcast_attn": false,
20
- "resid_pdrop": 0.1,
21
- "scale_attn_by_inverse_layer_idx": false,
22
- "scale_attn_weights": true,
23
- "summary_activation": null,
24
- "summary_first_dropout": 0.1,
25
- "summary_proj_to_labels": true,
26
- "summary_type": "cls_index",
27
- "summary_use_proj": true,
28
- "task_specific_params": {
29
- "text-generation": {
30
- "do_sample": true,
31
- "max_length": 50
32
- }
33
- },
34
- "torch_dtype": "float32",
35
- "transformers_version": "4.55.4",
36
- "use_cache": true,
37
- "vocab_size": 30002
38
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
generation_config.json DELETED
@@ -1,7 +0,0 @@
1
- {
2
- "_from_model_config": true,
3
- "bos_token_id": 50256,
4
- "eos_token_id": 2,
5
- "pad_token_id": 3,
6
- "transformers_version": "4.55.4"
7
- }
 
 
 
 
 
 
 
 
model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:02365cb1aa0e86be0439567e7a8d15b51234dfdc051025ab1b2e4b157923debb
3
- size 438696576
 
 
 
 
special_tokens_map.json DELETED
@@ -1,23 +0,0 @@
1
- {
2
- "bos_token": {
3
- "content": "<s>",
4
- "lstrip": false,
5
- "normalized": false,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "eos_token": {
10
- "content": "[SEP]",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": {
17
- "content": "[PAD]",
18
- "lstrip": false,
19
- "normalized": false,
20
- "rstrip": false,
21
- "single_word": false
22
- }
23
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tokenizer.json DELETED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json DELETED
@@ -1,67 +0,0 @@
1
- {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "[UNK]",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "1": {
12
- "content": "[CLS]",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "2": {
20
- "content": "[SEP]",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "3": {
28
- "content": "[PAD]",
29
- "lstrip": false,
30
- "normalized": false,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- },
35
- "4": {
36
- "content": "[MASK]",
37
- "lstrip": false,
38
- "normalized": false,
39
- "rstrip": false,
40
- "single_word": false,
41
- "special": true
42
- },
43
- "30000": {
44
- "content": "<s>",
45
- "lstrip": false,
46
- "normalized": false,
47
- "rstrip": false,
48
- "single_word": false,
49
- "special": true
50
- },
51
- "30001": {
52
- "content": "</s>",
53
- "lstrip": false,
54
- "normalized": false,
55
- "rstrip": false,
56
- "single_word": false,
57
- "special": true
58
- }
59
- },
60
- "bos_token": "<s>",
61
- "clean_up_tokenization_spaces": false,
62
- "eos_token": "[SEP]",
63
- "extra_special_tokens": {},
64
- "model_max_length": 1000000000000000019884624838656,
65
- "pad_token": "[PAD]",
66
- "tokenizer_class": "PreTrainedTokenizerFast"
67
- }