Spaces:
Running
Running
# Data Directory | |
This directory contains molecular data used by the Chem-MRL demo application. | |
## Dataset Information | |
### Isomer Design Dataset | |
The molecular data used in this application is sourced from the **Isomer Design** molecular library. | |
- **Dataset Source**: [Isomer Design](https://isomerdesign.com/pihkal/home) | |
- **License**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) [](https://creativecommons.org/licenses/by-nc-sa/4.0/) | |
- **License Type**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International | |
### License Terms | |
This dataset is licensed under CC BY-NC-SA 4.0, which means: | |
- ✅ **Attribution**: You must give appropriate credit to the original source | |
- ❌ **NonCommercial**: You may not use the material for commercial purposes | |
- ✅ **ShareAlike**: If you remix, transform, or build upon the material, you must distribute your contributions under the same license | |
### Usage in This Project | |
The dataset is used to: | |
- Populate the Redis vector database with molecular embeddings | |
- Provide sample molecules for demonstration purposes | |
- Enable similarity search functionality through HNSW indexing | |
### Data Processing | |
The original SMILES data from Isomer Design has been processed through the following pipeline: | |
1. **Canonicalization**: SMILES strings were canonicalized using RDKit's implementation to ensure consistent molecular representations | |
2. **Embedding Generation**: Canonical SMILES were processed using the Chem-MRL model to generate molecular embeddings at various dimensions (2, 4, 32, 128, 512, 1024) | |
3. **Vector Storage**: The resulting embeddings are stored in the Redis vector database and indexed using HNSW for efficient similarity search operations | |
### Citation | |
If you use this data in your research or applications, please cite the original Isomer Design dataset and respect the CC BY-NC-SA 4.0 license terms. |