PockNet – Fusion Transformer (Selective SWA, multi-seed release)

Model Summary

  • Architecture: Fusion transformer combining tabular SAS descriptors with centred ESM2-3B residue embeddings, followed by k-NN attention over local neighbourhoods.
  • Checkpoint: selective_swa_epoch09_12.ckpt (stochastic weight averaged blend of epochs 20–30).
  • Evaluation: Release metrics aggregate five independently-seeded SWA runs; per-seed artefacts live under outputs/final_seed_sweep/.
  • Input: Optimised H5 datasets from run_h5_generation_optimized.sh (tabular, esm, neighbour tensors).
  • Output: Residue-wise ligandability probabilities plus P2Rank-style pocket CSVs/visualisations.

Intended Use & Limitations

Intended Use Notes
Structure-based binding-pocket detection for academic or non-commercial research Designed to reproduce and extend P2Rank experiments using BU48 and related datasets
Evaluation via the provided auto-run / predict-dataset orchestration Ensures calibration, clustering, and reporting match the release scripts

Limitations

  • Trained on BU48-style protein chains with solvent-accessible surface sampling; transfer to radically different proteins is unverified.
  • Requires pretrained ESM2-3B embeddings; ensure consistent preprocessing (chain-level .pt files) for best results.

Training Data & Procedure

  • Datasets: Training/validation draw from CHEN11 plus the full set of “joint” P2Rank datasets (directories under data/p2rank-datasets/joined/*) aggregated in data/all_train.ds. BU48 (48 apo/holo pairs) is held out exclusively for evaluation/testing.
  • Features: src/datagen/extract_protein_features.py (tabular descriptors) + src/datagen/merge_chainfix_complete.py.
  • Embeddings: src/tools/generate_esm2_embeddings.py (ESM2_t36_3B_UR50D).
  • H5 assembly: run_h5_generation_optimized.shdata/h5/all_train_transformer_v2_optimized.h5 with neighbour tensors and split labels.
  • Training: Preferred via python src/scripts/end_to_end_pipeline.py train-model -o experiment=fusion_transformer_aggressive ....
  • Multi-seed sweep: Seeds {13, 21, 34, 55, 89} plus the reference 2025 run; SWA averages checkpoints from epochs 20–30.
  • Hardware: 3× NVIDIA V100 (16 GB) for training, single V100 for inference/post-processing.
  • Logging: PyTorch Lightning 2.5 + Hydra 1.3, W&B project fusion_pocknet_thesis.

Metrics

Point-level (single-seed SWA checkpoint)

Metric Value Split
IoU 0.2950 BU48 (test)
PR-AUC 0.414 BU48 (test)
ROC-AUC 0.944 BU48 (test)

Pocket-level (5-seed aggregated release, DBSCAN post-processing)

Metric Mean 95 % CI Notes
Mean IoU 0.1276 ±0.0124 Average pocket IoU across BU48
Best IoU (oracle) 0.1580 ±0.0141 Max IoU per protein
GT Coverage 0.8979 ±0.0057 Fraction of GT pockets matched
Avg pockets / protein 6.37 ±0.87 Post-threshold pockets

Success rates (DBSCAN, eps=3.0, min_samples=5, score threshold 0.91):

  • DCA success@1: 75 %
  • DCC success@1: 39 %
  • DCA success@3: 89 %
  • DCC success@3: 50 %

Refer to outputs/final_seed_sweep/*.csv for the exact release numbers cited by the thesis (Chapters 5–7 and Appendix 91).

How to Use

1. Download with huggingface_hub

from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download("lal3lu03/PockNet", "selective_swa_epoch09_12.ckpt")
print(ckpt_path)  # local file path

2. Run the end-to-end pipeline (CLI / Docker)

Preferred CLI workflow:

python src/scripts/end_to_end_pipeline.py predict-dataset \
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
  --h5 data/h5/all_train_transformer_v2_optimized.h5 \
  --csv data/vectorsTrain_all_chainfix.csv \
  --output outputs/bu48_release

Or inside Docker:

make docker-run ARGS="predict-dataset --checkpoint /ckpts/best.ckpt --h5 /data/h5/all_train_transformer_v2_optimized.h5 --csv /data/vectorsTrain_all_chainfix.csv --output /logs/bu48_release"

3. Single-protein inference

If you already have an H5 + vectors CSV and want to inspect a single structure:

python src/scripts/end_to_end_pipeline.py predict-pdb 1a4j_H \
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
  --h5 data/h5/all_train_transformer_v2_optimized.h5 \
  --csv data/vectorsTrain_all_chainfix.csv \
  --output outputs/pocknet_single_1a4j

Files Included in the Hugging Face Repo

  • selective_swa_epoch09_12.ckpt – release checkpoint
  • MODEL_CARD.md – this document

All supporting scripts (src/scripts/end_to_end_pipeline.py, Dockerfile, data-generation tooling, notebooks) and artefacts (outputs/final_seed_sweep/*, figures, thesis sources) remain in the public GitHub repository: https://github.com/hageneder/PockNet. Refer there for full reproducibility instructions, figures, and provenance logs.

Citation

If you use PockNet in your work, please cite:

@misc{lal3lu03_pocknet_2025,
  title   = {PockNet Fusion Transformer Release},
  author  = {Hageneder, Max},
  year    = {2025},
  url     = {https://huggingface.co/lal3lu03/PockNet}
}

License

Apache License 2.0. Refer to the repository LICENSE for full terms and ensure compliance with upstream dataset/ESM2 licenses when redistributing.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support