File size: 10,710 Bytes

---
license: mit
tags:
- sentence-transformers
- chemistry
- molecular-similarity
- cheminformatics
- ssl
- smiles
- feature-extraction
pipeline_tag: sentence-similarity
library_name: sentence-transformers
---

# miniChembed-prototype

This is an experimental **self-supervised molecular embedding** model trained using the **Barlow Twins** objective on approximately **24K unlabeled SMILES strings**. If validated as effective, it will be scaled to 2.1M molecules. The training data were compiled from public sources including:

- **ChEMBL34** (Zdrazil et al., 2023)  
- **COCONUTDB** (Sorokina et al., 2021)  
- **SuperNatural3** (Gallo et al., 2023)

The model maps SMILES strings to a **320-dimensional dense vector space**, optimized for **molecular similarity search, clustering, and scaffold analysis without any supervision from bioactivity, property labels, or precomputed fingerprints**.

Unlike fixed fingerprints (e.g., ECFP4), this model learns representations directly from **stochastic SMILES augmentations**, encouraging invariance to syntactic variation while potentially maximizing representational diversity across molecules. 
The Barlow Twins objective explicitly minimizes redundancy between embedding dimensions, promoting structured, non-collapsed representations.

> Note: This is an experimental prototype.  
> Feel free to experiment with and edit the training script as you wish!  
> Correcting my mistakes, tweaking augmentations, loss weights, optimizer settings, or network architecture could lead to even better representations.
---

## Model Details

### Architecture & Training

| Attribute | Value |
|----------|-------|
| **Base architecture** | Custom RoBERTa-style transformer (6 layers, 320 hidden dim, 4 attention heads, ~8M params) |
| **Initialization** | Random (not pretrained on text or chemistry) |
| **Training objective** | **Barlow Twins**, redundancy-reduction via cross-correlation matrix |
| **Augmentation** | Stochastic SMILES enumeration (`MolToSmiles(..., doRandom=True)`) |
| **Training data** | ~24K unique molecules → augmented into positive pairs |
| **Sequence length** | 514 tokens |
| **Embedding dimension** | 320 |
| **Projection head** | 3-layer MLP with BatchNorm (2048 → 2048 → 2048) |
| **Pooling** | Mean pooling over token embeddings |
| **Similarity metric** | Cosine similarity |
| **Effective batch size** | 64 (physical batch: 16, gradient accumulation: 4×) |
| **Learning rate** | 1e-4 |
| **Optimizer** | **Ranger21** (with warmup/warmdown scheduling) |
| **Weight decay** | 0.01 (applied selectively: no decay on bias/LayerNorm) |
| **Barlow λ** | 5.0 (stronger off-diagonal penalty) |
| **Training duration** | 5 epochs |
| **Hardware** | Single NVIDIA 930MX GPU |

### Architecture (SentenceTransformer format)
```python
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
  (1): Pooling({'word_embedding_dimension': 320, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

> Note: The model was not initialized from a language model, it is trained from scratch on SMILES using only the Barlow Twins objective.
---

## Usage

### Installation
```bash
pip install -U sentence-transformers rdkit-pypi
```

### Direct Usage (Sentence Transformers)
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("gbyuvd/miniChembed-prototype")
# Run inference
sentences = [
    'O=C1/C=C\\C=C2/N1C[C@@H]3CNC[C@H]2C3', # Cytisine
    "n1c2cc3c(cc2ncc1)[C@@H]4CNC[C@H]3C4",  # Varenicline
    "c1ncccc1[C@@H]2CCCN2C",                # Nicotine
    'Nc1nc2cncc-2co1',                      # CID: 162789184  
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# (4, 320)

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000,  0.2279, -0.1979, -0.3754],
#         [ 0.2279,  1.0000,  0.7371,  0.6745],
#         [-0.1979,  0.7371,  1.0000,  0.9803],
#         [-0.3754,  0.6745,  0.9803,  1.0000]])
```

High cosine similarity suggests structural or topological relatedness learned purely from SMILES variation and not from explicit chemical knowledge/labeling.

### Testing Similarity Search
> Tip: For large-scale similarity search, integrate embeddings with Meta's FAISS.

For an example of FAISS indexing pipeline, see `./examples/faiss.ipynb`

Cytisine as query, on 24K embedded index:
![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/kZciikiDjFOCXJrCzb1Lh.png)

```
Rank 1: SMILES = O=C1OC2C(O)CC1C1C2N(Cc2ccc(F)cc2)C(=S)N1CC1CCCCC1, Cosine Similarity = 0.9944
Rank 2: SMILES = CN1C(CCC(=O)N2CCC(O)CC2)CNC(=O)C2C1CCN2Cc1ncc[nH]1, Cosine Similarity = 0.9940
Rank 3: SMILES = CC1C(=O)OC2C1CCC1(C)Cc3sc(NC(=O)Nc4cccc(F)c4)nc3C(C)C21, Cosine Similarity = 0.9938
Rank 4: SMILES = Cc1ccc(NC(=O)Nc2nc3c(s2)CC2(C)CCC4C(C)C(=O)OC4C2C3C)cc1, Cosine Similarity = 0.9938
Rank 5: SMILES = O=C(CC1CC2OC(CNC3Cc4ccccc4C3)C(O)C2O1)N1CCC(F)(F)C1, Cosine Similarity = 0.9929
```


## Comparison to Traditional Fingerprints
### Overview
| Feature | ECFP4 / MACCS | miniChembed-prototype |
|--------|----------------|------------------------|
| **Representation** | Hand-crafted binary fingerprint | Learned dense embedding |
| **Training data** | None (rule-based) | ~24K unlabeled SMILES |
| **Global semantics** | Captures only local substructures | Learns global invariances via augmentation |
| **Redundancy control** | Not applicable | Explicitly minimized (Barlow objective) |

### Clustering

Preliminary clustering evaluation vs. ECFP4 on 64 molecules with 4 classes:

![image](https://cdn-uploads.huggingface.co/production/uploads/667da868d653c0b02d6a2399/SNH7u0tegdzmYGFbJ9F-0.png)

```
ARI (Embeddings)                       : 0.084
ARI (ECFP4)                            : 0.024
Silhouette (Embeddings)                : 0.398
Silhouette (ECFP4)                     : 0.025
```

---

## Training Summary

- **Objective**: Minimize off-diagonal terms in the cross-correlation matrix of augmented views.
- **Key metric**: Barlow Health Score = `mean(same-molecule cosine) – mean(cross-molecule cosine)`  
  → Higher = better separation between intra- and inter-molecular similarity.
- **Validation**: Evaluated every 25% of training; best checkpoint selected by health score.
- **Final health**: 0.891 at step 1885, indicating strong disentanglement.

```
   Step 1885 | Alignment=0.017 | Uniformity=-1.338
   Same-mol cos: 0.983±0.032 | Pairwise: 0.093±0.518
   Barlow Health: 0.891 
```
---

## Limitations

- Trained on **drug-like organic molecules**; performance on inorganics, salts, or polymers is unknown.
- Input must be **valid SMILES**; invalid strings may produce erratic embeddings.
- **Not trained on bioactivity data**, so similarity indicates structural syntax, not biological function.
- Small-scale prototype (~24K); final version will scale to 2.1M molecules if proven effective.

---

## Reproducibility

This model was trained using a custom script based on Sentence Transformers v5.1.0, with the following environment:

- Python: 3.13.0
- Transformers: 4.56.2
- PyTorch: 2.6.0+cu126
- Accelerate: 1.10.1
- Datasets: 4.0.0
- Tokenizers: 0.22.0

Training code, config, and evaluation are available on this repo under `./train/trainbarlow.py` and `./train/config.yaml`

---

## Reference:
Do note that the method used here doesn't use a target network, rather, using RDKit-augmented enumeration of each molecule's SMILES.

```
@misc{çağatan2024unseeunsupervisednoncontrastivesentence,
      title={UNSEE: Unsupervised Non-contrastive Sentence Embeddings}, 
      author={Ömer Veysel Çağatan},
      year={2024},
      eprint={2401.15316},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2401.15316}, 
}
```
--- 

## Citation

If you use this model, please cite:

```bibtex
SBERT:
@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  year = "2019",
  url = "https://arxiv.org/abs/1908.10084"
}

Tokenizer:
@misc{chithrananda2020chembertalargescaleselfsupervisedpretraining,
      title={ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction}, 
      author={Seyone Chithrananda and Gabriel Grand and Bharath Ramsundar},
      year={2020},
      eprint={2010.09885},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2010.09885}, 
}

Data:
@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}

@article{zdrazil2023chembl,
  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
  journal={Nucleic Acids Research},
  year={2023},
  volume={gkad1004},
  doi={10.1093/nar/gkad1004}
}

@misc{chembl34,
  title={ChemBL34},
  year={2023},
  doi={10.6019/CHEMBL.database.34}
}

@article{Gallo2023,
  author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
  title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
  journal = {Nucleic Acids Research},
  year = {2023},
  month = jan,
  day = {6},
  volume = {51},
  number = {D1},
  pages = {D654-D659},
  doi = {10.1093/nar/gkac1008}
}

Optimizer:
@article{wright2021ranger21,
      title={Ranger21: a synergistic deep learning optimizer}, 
      author={Wright, Less and Demeure, Nestor},
      year={2021},
      journal={arXiv preprint arXiv:2106.13731},
}

```