PockNet – Fusion Transformer (Selective SWA, multi-seed release)

Model Summary

Architecture: Fusion transformer combining tabular SAS descriptors with centred ESM2-3B residue embeddings, followed by k-NN attention over local neighbourhoods.
Checkpoint: selective_swa_epoch09_12.ckpt (stochastic weight averaged blend of epochs 20–30).
Evaluation: Release metrics aggregate five independently-seeded SWA runs; per-seed artefacts live under outputs/final_seed_sweep/.
Input: Optimised H5 datasets from run_h5_generation_optimized.sh (tabular, esm, neighbour tensors).
Output: Residue-wise ligandability probabilities plus P2Rank-style pocket CSVs/visualisations.

Intended Use & Limitations

Intended Use	Notes
Structure-based binding-pocket detection for academic or non-commercial research	Designed to reproduce and extend P2Rank experiments using BU48 and related datasets
Evaluation via the provided `auto-run` / `predict-dataset` orchestration	Ensures calibration, clustering, and reporting match the release scripts

Limitations

Trained on BU48-style protein chains with solvent-accessible surface sampling; transfer to radically different proteins is unverified.
Requires pretrained ESM2-3B embeddings; ensure consistent preprocessing (chain-level .pt files) for best results.

Training Data & Procedure

Datasets: Training/validation draw from CHEN11 plus the full set of “joint” P2Rank datasets (directories under data/p2rank-datasets/joined/*) aggregated in data/all_train.ds. BU48 (48 apo/holo pairs) is held out exclusively for evaluation/testing.
Features: src/datagen/extract_protein_features.py (tabular descriptors) + src/datagen/merge_chainfix_complete.py.
Embeddings: src/tools/generate_esm2_embeddings.py (ESM2_t36_3B_UR50D).
H5 assembly: run_h5_generation_optimized.sh → data/h5/all_train_transformer_v2_optimized.h5 with neighbour tensors and split labels.
Training: Preferred via python src/scripts/end_to_end_pipeline.py train-model -o experiment=fusion_transformer_aggressive ....
Multi-seed sweep: Seeds {13, 21, 34, 55, 89} plus the reference 2025 run; SWA averages checkpoints from epochs 20–30.
Hardware: 3× NVIDIA V100 (16 GB) for training, single V100 for inference/post-processing.
Logging: PyTorch Lightning 2.5 + Hydra 1.3, W&B project fusion_pocknet_thesis.

Metrics

Point-level (single-seed SWA checkpoint)

Metric	Value	Split
IoU	0.2950	BU48 (test)
PR-AUC	0.414	BU48 (test)
ROC-AUC	0.944	BU48 (test)

Pocket-level (5-seed aggregated release, DBSCAN post-processing)

Metric	Mean	95 % CI	Notes
Mean IoU	0.1276	±0.0124	Average pocket IoU across BU48
Best IoU (oracle)	0.1580	±0.0141	Max IoU per protein
GT Coverage	0.8979	±0.0057	Fraction of GT pockets matched
Avg pockets / protein	6.37	±0.87	Post-threshold pockets

Success rates (DBSCAN, eps=3.0, min_samples=5, score threshold 0.91):

DCA success@1: 75 %
DCC success@1: 39 %
DCA success@3: 89 %
DCC success@3: 50 %

Refer to outputs/final_seed_sweep/*.csv for the exact release numbers cited by the thesis (Chapters 5–7 and Appendix 91).

How to Use

1. Download with `huggingface_hub`

from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download("lal3lu03/PockNet", "selective_swa_epoch09_12.ckpt")
print(ckpt_path)  # local file path

2. Run the end-to-end pipeline (CLI / Docker)

Preferred CLI workflow:

python src/scripts/end_to_end_pipeline.py predict-dataset \
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
  --h5 data/h5/all_train_transformer_v2_optimized.h5 \
  --csv data/vectorsTrain_all_chainfix.csv \
  --output outputs/bu48_release

Or inside Docker:

make docker-run ARGS="predict-dataset --checkpoint /ckpts/best.ckpt --h5 /data/h5/all_train_transformer_v2_optimized.h5 --csv /data/vectorsTrain_all_chainfix.csv --output /logs/bu48_release"

3. Single-protein inference

If you already have an H5 + vectors CSV and want to inspect a single structure:

python src/scripts/end_to_end_pipeline.py predict-pdb 1a4j_H \
  --checkpoint /path/to/selective_swa_epoch09_12.ckpt \
  --h5 data/h5/all_train_transformer_v2_optimized.h5 \
  --csv data/vectorsTrain_all_chainfix.csv \
  --output outputs/pocknet_single_1a4j

Files Included in the Hugging Face Repo

selective_swa_epoch09_12.ckpt – release checkpoint
MODEL_CARD.md – this document

All supporting scripts (src/scripts/end_to_end_pipeline.py, Dockerfile, data-generation tooling, notebooks) and artefacts (outputs/final_seed_sweep/*, figures, thesis sources) remain in the public GitHub repository: https://github.com/lal3lu03/PockNet. Refer there for full reproducibility instructions, figures, and provenance logs.

Citation

If you use PockNet in your work, please cite:

@misc{lal3lu03_pocknet_2025,
  title   = {PockNet Fusion Transformer Release},
  author  = {Hageneder, Max},
  year    = {2025},
  url     = {https://huggingface.co/lal3lu03/PockNet}
}

License

Apache License 2.0. Refer to the repository LICENSE for full terms and ensure compliance with upstream dataset/ESM2 licenses when redistributing.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support