|
--- |
|
license: apache-2.0 |
|
tags: |
|
- LucaOne |
|
- Biological Foundation Model |
|
- Unified Nucleic Acid and Protein Language Model |
|
- Biology |
|
- AI4Science |
|
- AI4Biology |
|
- Bio |
|
language: |
|
- en |
|
--- |
|
|
|
# LucaOne/LucaGPLM(old checkpoint: 5.6M) |
|
|
|
LucaOne/LucaGPLM - The LUCA Gene-Protein language model. |
|
|
|
## Installation |
|
|
|
You can install the package from source using pip: |
|
|
|
```bash |
|
pip install tokenizers==0.19.1 |
|
pip install transformers==4.41.2 |
|
pip install lucagplm |
|
``` |
|
|
|
## Usage |
|
|
|
```python |
|
from lucagplm import LucaGPLMModel, LucaGPLMTokenizer |
|
|
|
# Load model |
|
model = LucaGPLMModel.from_pretrained("LucaGroup/LucaOne-default-step5.6M") |
|
tokenizer = LucaGPLMTokenizer.from_pretrained("LucaGroup/LucaOne-default-step5.6M") |
|
|
|
# Example usage |
|
seq = "ATCG" |
|
# seq_type="gene", which includes DNA or RNA(Nucleotide Sequences) |
|
inputs = tokenizer(seq, seq_type="gene",return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
print(outputs.last_hidden_state.shape) |
|
|
|
seq = "NSQTA" |
|
inputs = tokenizer(seq, seq_type="prot",return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
print(outputs.last_hidden_state.shape) |
|
``` |
|
|
|
|
|
## Github |
|
For long sequence embedding, please refer to the git repository: |
|
|
|
https://github.com/LucaOne/LucaOne |