PDeepPP_neuro / README.md
fondress's picture
Upload README.md with huggingface_hub
91be776 verified
---
tags:
- protein language model
pipeline_tag: text-classification
---
# PDeepPP model
`PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.
## Model description
`PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:
1. A **Self-Attention Global Features module** for capturing long-range dependencies.
2. A **TransConv1d module**, combining transformers and convolutional layers.
3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.
The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.
## Intended uses
`PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:
1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.
Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.
---
### Key features
- **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
- **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives.
- **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
- **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.
## How to use
To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers
```
Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file.
Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:
```python
import torch
import esm
from DataProcessor_pdeeppp import PDeepPPProcessor
from Pretraining_pdeeppp import PretrainingPDeepPP
from transformers import AutoModel
# Global parameter settings
device = torch.device("cpu")
pad_char = "X" # Padding character
target_length = 33 # Target length for sequence padding
mode = "BPS" # Mode setting (only configured in example.py)
esm_ratio = 1 # Ratio for ESM embeddings
# Load the PDeepPP model
model_name = "fondress/PDeepPP_neuro"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Directly load the model
# Initialize the PDeepPPProcessor
processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)
# Example protein sequences (test sequences)
protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]
# Preprocess the sequences
inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt") # Dynamic mode parameter
processed_sequences = inputs["raw_sequences"]
# Load the ESM model
esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
esm_model = esm_model.to(device)
esm_model.eval()
# Initialize the PretrainingPDeepPP module
pretrainer = PretrainingPDeepPP(
embedding_dim=1280,
target_length=target_length,
esm_ratio=esm_ratio,
device=device
)
# Extract the vocabulary and ensure the padding character 'X' is included
vocab = set("".join(protein_sequences))
vocab.add(pad_char) # Add the padding character
# Generate pretrained features using the PretrainingPDeepPP module
pretrained_features = pretrainer.create_embeddings(
processed_sequences, vocab, esm_model, esm_alphabet
)
# Ensure pretrained features are on the same device
inputs["input_embeds"] = pretrained_features.to(device)
# Perform prediction
model.eval()
outputs = model(input_embeds=inputs["input_embeds"]) # Use pretrained features as model input
logits = outputs["logits"]
# Compute probability distributions and generate predictions
softmax = torch.nn.Softmax(dim=-1) # Apply softmax on the last dimension
probabilities = softmax(logits)
predicted_labels = (probabilities >= 0.5).long()
# Print the prediction results for each sequence
print("\nPrediction Results:")
for i, seq in enumerate(processed_sequences):
print(f"Sequence: {seq}")
print(f"Probability: {probabilities[i].item():.4f}")
print(f"Predicted Label: {predicted_labels[i].item()}")
print("-" * 50)
```
## Training and customization
`PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as:
- **Number of transformer layers**
- **Hidden layer size**
- **Dropout rate**
- **PTM type** and other task-specific parameters
Refer to `PDeepPPConfig` for details.
## Citation
If you use `PDeepPP` in your research, please cite the associated paper or repository:
```
@article{your_reference,
title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
author={Author Name},
journal={Journal Name},
year={2025}
}
```