--- tags: - protein language model pipeline_tag: text-classification --- # PDeepPP model `PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts. ## Model description `PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of: 1. A **Self-Attention Global Features module** for capturing long-range dependencies. 2. A **TransConv1d module**, combining transformers and convolutional layers. 3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction. The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows. ## Intended uses `PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets: 1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues. 2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses. Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses. --- ### Key features - **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions. - **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives. - **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity. - **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features. ## How to use To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`: ```bash pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install transformers ``` Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file. Here is an example of how to use PDeepPP to process protein sequences and obtain predictions: ```python import torch import esm from DataProcessor_pdeeppp import PDeepPPProcessor from Pretraining_pdeeppp import PretrainingPDeepPP from transformers import AutoModel # Global parameter settings device = torch.device("cpu") pad_char = "X" # Padding character target_length = 33 # Target length for sequence padding mode = "BPS" # Mode setting (only configured in example.py) esm_ratio = 1 # Ratio for ESM embeddings # Load the PDeepPP model model_name = "fondress/PDeepPP_Antibacterial" model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Directly load the model # Initialize the PDeepPPProcessor processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length) # Example protein sequences (test sequences) protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"] # Preprocess the sequences inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt") # Dynamic mode parameter processed_sequences = inputs["raw_sequences"] # Load the ESM model esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D() esm_model = esm_model.to(device) esm_model.eval() # Initialize the PretrainingPDeepPP module pretrainer = PretrainingPDeepPP( embedding_dim=1280, target_length=target_length, esm_ratio=esm_ratio, device=device ) # Extract the vocabulary and ensure the padding character 'X' is included vocab = set("".join(protein_sequences)) vocab.add(pad_char) # Add the padding character # Generate pretrained features using the PretrainingPDeepPP module pretrained_features = pretrainer.create_embeddings( processed_sequences, vocab, esm_model, esm_alphabet ) # Ensure pretrained features are on the same device inputs["input_embeds"] = pretrained_features.to(device) # Perform prediction model.eval() outputs = model(input_embeds=inputs["input_embeds"]) # Use pretrained features as model input logits = outputs["logits"] # Compute probability distributions and generate predictions softmax = torch.nn.Softmax(dim=-1) # Apply softmax on the last dimension probabilities = softmax(logits) predicted_labels = (probabilities >= 0.5).long() # Print the prediction results for each sequence print("\nPrediction Results:") for i, seq in enumerate(processed_sequences): print(f"Sequence: {seq}") print(f"Probability: {probabilities[i].item():.4f}") print(f"Predicted Label: {predicted_labels[i].item()}") print("-" * 50) ``` ## Training and customization `PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as: - **Number of transformer layers** - **Hidden layer size** - **Dropout rate** - **PTM type** and other task-specific parameters Refer to `PDeepPPConfig` for details. ## Citation If you use `PDeepPP` in your research, please cite the associated paper or repository: ``` @article{your_reference, title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis}, author={Author Name}, journal={Journal Name}, year={2025} } ```