chem-mrl-demo / src /data /README.md
eacortes's picture
push new db or demo version
2ca94d3
# Data Directory
This directory contains molecular data used by the Chem-MRL demo application.
## Dataset Information
### Isomer Design Dataset
The molecular data used in this application is sourced from the **Isomer Design** molecular library.
- **Dataset Source**: [Isomer Design](https://isomerdesign.com/pihkal/home)
- **License**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) [![License: CC BY-NC-SA 4.0](https://mirrors.creativecommons.org/presskit/buttons/80x15/svg/by-nc-sa.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
- **License Type**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
### License Terms
This dataset is licensed under CC BY-NC-SA 4.0, which means:
-**Attribution**: You must give appropriate credit to the original source
-**NonCommercial**: You may not use the material for commercial purposes
-**ShareAlike**: If you remix, transform, or build upon the material, you must distribute your contributions under the same license
### Usage in This Project
The dataset is used to:
- Populate the Redis vector database with molecular embeddings
- Provide sample molecules for demonstration purposes
- Enable similarity search functionality through HNSW indexing
### Data Processing
The original SMILES data from Isomer Design has been processed through the following pipeline:
1. **Canonicalization**: SMILES strings were canonicalized using RDKit's implementation to ensure consistent molecular representations
2. **Embedding Generation**: Canonical SMILES were processed using the Chem-MRL model to generate molecular embeddings at various dimensions (2, 4, 32, 128, 512, 1024)
3. **Vector Storage**: The resulting embeddings are stored in the Redis vector database and indexed using HNSW for efficient similarity search operations
### Citation
If you use this data in your research or applications, please cite the original Isomer Design dataset and respect the CC BY-NC-SA 4.0 license terms.