ProteoRift
Abstract
Mass-based filtering significantly reduces the peptide candidate pool for subsequent scoring in database search algorithms. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides – potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra 77.8% of the time. Integrating ProteoRift into an end-to-end pipeline significantly reduces the search space compared to mass-only filtering. This delivers 8x to 12x speedups while maintaining peptide deduction accuracy comparable to established algorithmic techniques. We also developed uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against the correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end pipeline available at https://github.com/pcdslab/ProteoRift.
Usage
Installation
pip install proteorift
Using Sample Data
from proteorift import ProteoRiftSearch
# Initialize and run with sample data
searcher = ProteoRiftSearch()
results = searcher.search_with_sample_data()
print(f"Results saved to: {results['output_dir']}")
Using Your Own Data
from proteorift import ProteoRiftSearch
# Initialize search
searcher = ProteoRiftSearch()
# Run peptide database search
results = searcher.search(
mgf_dir="path/to/your/spectra", # Directory with MGF files
peptide_db="path/to/your/database", # Directory with FASTA files
output_dir="./results"
)
Custom Parameters
searcher = ProteoRiftSearch(
precursor_tolerance=10,
precursor_tolerance_type="ppm",
charge=3,
length_filter=True,
device="cuda"
)
results = searcher.search(mgf_dir="...", peptide_db="...")
Command Line Interface
# Run search with sample data
proteorift search-sample --output-dir ./results
# Run search with your data
proteorift search \
--mgf-dir path/to/spectra \
--peptide-db path/to/database \
--output-dir ./results \
--tolerance 10 \
--charge 3
# Download models only
proteorift download-models
Output
ProteoRift generates Percolator-compatible PIN files:
target.pin- Target peptide-spectrum matchesdecoy.pin- Decoy peptide-spectrum matches
Training Data
The model was trained on large-scale mass spectrometry datasets including:
- NIST human peptide libraries
- MassIVE public datasets
- DeepNovo
System Requirements
- GPU Memory: 12GB+ recommended
- Python: 3.8+
- PyTorch: 1.10+
Citation
If you use ProteoRift in your research, please cite the following paper:
Tariq, U., Shabbir, B. & Saeed, F. End-to-end deep attention-based multitask pipeline for predicting uncertainty-quantified peptide properties from mass spectrometry data. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43215-2
License
This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.