Spaces:
Running
Running
Data Directory
This directory contains molecular data used by the Chem-MRL demo application.
Dataset Information
Isomer Design Dataset
The molecular data used in this application is sourced from the Isomer Design molecular library.
- Dataset Source: Isomer Design
- License: CC BY-NC-SA 4.0
- License Type: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
License Terms
This dataset is licensed under CC BY-NC-SA 4.0, which means:
- ✅ Attribution: You must give appropriate credit to the original source
- ❌ NonCommercial: You may not use the material for commercial purposes
- ✅ ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license
Usage in This Project
The dataset is used to:
- Populate the Redis vector database with molecular embeddings
- Provide sample molecules for demonstration purposes
- Enable similarity search functionality through HNSW indexing
Data Processing
The original SMILES data from Isomer Design has been processed through the following pipeline:
- Canonicalization: SMILES strings were canonicalized using RDKit's implementation to ensure consistent molecular representations
- Embedding Generation: Canonical SMILES were processed using the Chem-MRL model to generate molecular embeddings at various dimensions (2, 4, 32, 128, 512, 1024)
- Vector Storage: The resulting embeddings are stored in the Redis vector database and indexed using HNSW for efficient similarity search operations
Citation
If you use this data in your research or applications, please cite the original Isomer Design dataset and respect the CC BY-NC-SA 4.0 license terms.