|
|
--- |
|
|
tags: |
|
|
- protein language model |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# PDeepPP model |
|
|
|
|
|
`PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts. |
|
|
|
|
|
## Model description |
|
|
|
|
|
`PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of: |
|
|
|
|
|
1. A **Self-Attention Global Features module** for capturing long-range dependencies. |
|
|
2. A **TransConv1d module**, combining transformers and convolutional layers. |
|
|
3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction. |
|
|
|
|
|
The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows. |
|
|
|
|
|
## Intended uses |
|
|
|
|
|
`PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets: |
|
|
|
|
|
1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues. |
|
|
2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses. |
|
|
|
|
|
Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses. |
|
|
|
|
|
--- |
|
|
|
|
|
### Key features |
|
|
|
|
|
- **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions. |
|
|
- **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives. |
|
|
- **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity. |
|
|
- **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features. |
|
|
|
|
|
## How to use |
|
|
|
|
|
To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`: |
|
|
|
|
|
```bash |
|
|
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 |
|
|
pip install transformers |
|
|
``` |
|
|
Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file. |
|
|
Here is an example of how to use PDeepPP to process protein sequences and obtain predictions: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import esm |
|
|
from DataProcessor_pdeeppp import PDeepPPProcessor |
|
|
from Pretraining_pdeeppp import PretrainingPDeepPP |
|
|
from transformers import AutoModel |
|
|
|
|
|
# Global parameter settings |
|
|
device = torch.device("cpu") |
|
|
pad_char = "X" # Padding character |
|
|
target_length = 33 # Target length for sequence padding |
|
|
mode = "BPS" # Mode setting (only configured in example.py) |
|
|
esm_ratio = 1 # Ratio for ESM embeddings |
|
|
|
|
|
# Load the PDeepPP model |
|
|
model_name = "fondress/PDeepPP_neuro" |
|
|
model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Directly load the model |
|
|
|
|
|
# Initialize the PDeepPPProcessor |
|
|
processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length) |
|
|
|
|
|
# Example protein sequences (test sequences) |
|
|
protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"] |
|
|
|
|
|
# Preprocess the sequences |
|
|
inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt") # Dynamic mode parameter |
|
|
processed_sequences = inputs["raw_sequences"] |
|
|
|
|
|
# Load the ESM model |
|
|
esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D() |
|
|
esm_model = esm_model.to(device) |
|
|
esm_model.eval() |
|
|
|
|
|
# Initialize the PretrainingPDeepPP module |
|
|
pretrainer = PretrainingPDeepPP( |
|
|
embedding_dim=1280, |
|
|
target_length=target_length, |
|
|
esm_ratio=esm_ratio, |
|
|
device=device |
|
|
) |
|
|
|
|
|
# Extract the vocabulary and ensure the padding character 'X' is included |
|
|
vocab = set("".join(protein_sequences)) |
|
|
vocab.add(pad_char) # Add the padding character |
|
|
|
|
|
# Generate pretrained features using the PretrainingPDeepPP module |
|
|
pretrained_features = pretrainer.create_embeddings( |
|
|
processed_sequences, vocab, esm_model, esm_alphabet |
|
|
) |
|
|
|
|
|
# Ensure pretrained features are on the same device |
|
|
inputs["input_embeds"] = pretrained_features.to(device) |
|
|
|
|
|
# Perform prediction |
|
|
model.eval() |
|
|
outputs = model(input_embeds=inputs["input_embeds"]) # Use pretrained features as model input |
|
|
logits = outputs["logits"] |
|
|
|
|
|
# Compute probability distributions and generate predictions |
|
|
softmax = torch.nn.Softmax(dim=-1) # Apply softmax on the last dimension |
|
|
probabilities = softmax(logits) |
|
|
predicted_labels = (probabilities >= 0.5).long() |
|
|
|
|
|
# Print the prediction results for each sequence |
|
|
print("\nPrediction Results:") |
|
|
for i, seq in enumerate(processed_sequences): |
|
|
print(f"Sequence: {seq}") |
|
|
print(f"Probability: {probabilities[i].item():.4f}") |
|
|
print(f"Predicted Label: {predicted_labels[i].item()}") |
|
|
print("-" * 50) |
|
|
``` |
|
|
|
|
|
## Training and customization |
|
|
|
|
|
`PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as: |
|
|
|
|
|
- **Number of transformer layers** |
|
|
- **Hidden layer size** |
|
|
- **Dropout rate** |
|
|
- **PTM type** and other task-specific parameters |
|
|
|
|
|
Refer to `PDeepPPConfig` for details. |
|
|
|
|
|
## Citation |
|
|
If you use `PDeepPP` in your research, please cite the associated paper or repository: |
|
|
|
|
|
``` |
|
|
@article{your_reference, |
|
|
title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis}, |
|
|
author={Author Name}, |
|
|
journal={Journal Name}, |
|
|
year={2025} |
|
|
} |
|
|
``` |