Discrepancy in Binding Affinity Data Splits - data leakage between train/val
Hi PeptiVerse Team,
I noticed a discrepancy between the paper and the codebase regarding the data splitting strategy. In table 1 of the preprint, it says that "PeptiVerse employs similarity-aware splits to evaluate out-of-distribution generalization." However, the provided script binding_affinity_split.py uses the function make_distribution_matched_split (Line 88), which performs a random stratified split based on affinity score distributions, with no clustering or sequence-identity checks.
I ran a quick check on the binding_affinity_wt_meta_with_split.csv file from the repo and found that there is significant overlap between the splits:
peptides
~25% of the val peps are exact matches to train peps
~33% have >90% sequence similarity to the training set
proteins
~39% of val prots are exact matches to train
~61% of val prots have >90% similarity to train
Just wanted to flag this as it seems to differ from the description in the preprint and might affect how we interpret the model's ability to generalize to new sequences. Could you clarify if this was the intended splitting strategy?
Thanks!
Hello,
Thanks for carefully checking the code and data splits.
For the binding affinity dataset, the split was intentionally performed using affinity score distribution matching, as stated in the Methods, rather than sequence-similarity clustering. There are no identical peptide–protein pairs shared between train and validation.
The overlap you observed comes from:
- Same target protein with different peptide binders, and
- Same peptide binding to different protein targets
As a result, high sequence similarity at the peptide or protein level can occur across splits even though the specific interaction pairs remain distinct. In this setting, the model is evaluated on its ability to rank or regress new interaction partners for previously seen peptides or targets, which reflects one practical use case of affinity prediction. That said, we agree that the choice of splitting strategy depends strongly on the intended notion of generalization. A stricter similarity-aware split would better isolate performance on entirely novel sequences, and this is a valid alternative evaluation protocol. To support this flexibility, we provide the raw processed data so that users can readily construct custom splits tailored to their specific application or generalization criteria.
Best regards,
Yinuo