---
tags:
- protein language model
pipeline_tag: text-classification
---

# PDeepPP model

`PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts.

## Model description

`PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of:

1. A **Self-Attention Global Features module** for capturing long-range dependencies.
2. A **TransConv1d module**, combining transformers and convolutional layers.
3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction.

The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows.

## Intended uses

`PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets:

1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues.
2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses.

Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses.

---

### Key features

- **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions.
- **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives.
- **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity.
- **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features.

## How to use

To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`:

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers
```
Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file.
Here is an example of how to use PDeepPP to process protein sequences and obtain predictions:

```python
import torch
import esm
from DataProcessor_pdeeppp import PDeepPPProcessor
from Pretraining_pdeeppp import PretrainingPDeepPP
from transformers import AutoModel

# Global parameter settings
device = torch.device("cpu")
pad_char = "X"  # Padding character
target_length = 33  # Target length for sequence padding
mode = "BPS"  # Mode setting (only configured in example.py)
esm_ratio = 1  # Ratio for ESM embeddings

# Load the PDeepPP model
model_name = "fondress/PDeepPP_Antibacterial"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)  # Directly load the model

# Initialize the PDeepPPProcessor
processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length)

# Example protein sequences (test sequences)
protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"]

# Preprocess the sequences
inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt")  # Dynamic mode parameter
processed_sequences = inputs["raw_sequences"]

# Load the ESM model
esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D()
esm_model = esm_model.to(device)
esm_model.eval()

# Initialize the PretrainingPDeepPP module
pretrainer = PretrainingPDeepPP(
    embedding_dim=1280, 
    target_length=target_length, 
    esm_ratio=esm_ratio, 
    device=device
)

# Extract the vocabulary and ensure the padding character 'X' is included
vocab = set("".join(protein_sequences))
vocab.add(pad_char)  # Add the padding character

# Generate pretrained features using the PretrainingPDeepPP module
pretrained_features = pretrainer.create_embeddings(
    processed_sequences, vocab, esm_model, esm_alphabet
)

# Ensure pretrained features are on the same device
inputs["input_embeds"] = pretrained_features.to(device)

# Perform prediction
model.eval()
outputs = model(input_embeds=inputs["input_embeds"])  # Use pretrained features as model input
logits = outputs["logits"]

# Compute probability distributions and generate predictions
softmax = torch.nn.Softmax(dim=-1)  # Apply softmax on the last dimension
probabilities = softmax(logits)
predicted_labels = (probabilities >= 0.5).long()

# Print the prediction results for each sequence
print("\nPrediction Results:")
for i, seq in enumerate(processed_sequences):
    print(f"Sequence: {seq}")
    print(f"Probability: {probabilities[i].item():.4f}")
    print(f"Predicted Label: {predicted_labels[i].item()}")
    print("-" * 50)
```

## Training and customization

`PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as:

- **Number of transformer layers**
- **Hidden layer size**
- **Dropout rate**
- **PTM type** and other task-specific parameters

Refer to `PDeepPPConfig` for details.

## Citation
If you use `PDeepPP` in your research, please cite the associated paper or repository:

```
@article{your_reference,
  title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis},
  author={Author Name},
  journal={Journal Name},
  year={2025}
}
```