ProteoRift

Github | Cite

Abstract

Mass-based filtering significantly reduces the peptide candidate pool for subsequent scoring in database search algorithms. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides – potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra 77.8% of the time. Integrating ProteoRift into an end-to-end pipeline significantly reduces the search space compared to mass-only filtering. This delivers 8x to 12x speedups while maintaining peptide deduction accuracy comparable to established algorithmic techniques. We also developed uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against the correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end pipeline available at https://github.com/pcdslab/ProteoRift.

Usage

Installation

pip install proteorift

Using Sample Data

from proteorift import ProteoRiftSearch

# Initialize and run with sample data
searcher = ProteoRiftSearch()
results = searcher.search_with_sample_data()

print(f"Results saved to: {results['output_dir']}")

Using Your Own Data

from proteorift import ProteoRiftSearch

# Initialize search
searcher = ProteoRiftSearch()

# Run peptide database search
results = searcher.search(
    mgf_dir="path/to/your/spectra",      # Directory with MGF files
    peptide_db="path/to/your/database",  # Directory with FASTA files
    output_dir="./results"
)

Custom Parameters

searcher = ProteoRiftSearch(
    precursor_tolerance=10,
    precursor_tolerance_type="ppm",
    charge=3,
    length_filter=True,
    device="cuda" 
)

results = searcher.search(mgf_dir="...", peptide_db="...")

Command Line Interface

# Run search with sample data
proteorift search-sample --output-dir ./results

# Run search with your data
proteorift search \
    --mgf-dir path/to/spectra \
    --peptide-db path/to/database \
    --output-dir ./results \
    --tolerance 10 \
    --charge 3

# Download models only
proteorift download-models

Output

ProteoRift generates Percolator-compatible PIN files:

  • target.pin - Target peptide-spectrum matches
  • decoy.pin - Decoy peptide-spectrum matches

Training Data

The model was trained on large-scale mass spectrometry datasets including:

  • NIST human peptide libraries
  • MassIVE public datasets
  • DeepNovo

System Requirements

  • GPU Memory: 12GB+ recommended
  • Python: 3.8+
  • PyTorch: 1.10+

Citation

If you use ProteoRift in your research, please cite the following paper:

Tariq, U., Shabbir, B. & Saeed, F. End-to-end deep attention-based multitask pipeline for predicting uncertainty-quantified peptide properties from mass spectrometry data. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43215-2

License

This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support