File size: 2,008 Bytes
2ca94d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Data Directory

This directory contains molecular data used by the Chem-MRL demo application.

## Dataset Information

### Isomer Design Dataset

The molecular data used in this application is sourced from the **Isomer Design** molecular library.

- **Dataset Source**: [Isomer Design](https://isomerdesign.com/pihkal/home)
- **License**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) [![License: CC BY-NC-SA 4.0](https://mirrors.creativecommons.org/presskit/buttons/80x15/svg/by-nc-sa.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
- **License Type**: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

### License Terms

This dataset is licensed under CC BY-NC-SA 4.0, which means:

-**Attribution**: You must give appropriate credit to the original source
-**NonCommercial**: You may not use the material for commercial purposes
-**ShareAlike**: If you remix, transform, or build upon the material, you must distribute your contributions under the same license

### Usage in This Project

The dataset is used to:
- Populate the Redis vector database with molecular embeddings
- Provide sample molecules for demonstration purposes
- Enable similarity search functionality through HNSW indexing

### Data Processing

The original SMILES data from Isomer Design has been processed through the following pipeline:

1. **Canonicalization**: SMILES strings were canonicalized using RDKit's implementation to ensure consistent molecular representations
2. **Embedding Generation**: Canonical SMILES were processed using the Chem-MRL model to generate molecular embeddings at various dimensions (2, 4, 32, 128, 512, 1024)
3. **Vector Storage**: The resulting embeddings are stored in the Redis vector database and indexed using HNSW for efficient similarity search operations

### Citation

If you use this data in your research or applications, please cite the original Isomer Design dataset and respect the CC BY-NC-SA 4.0 license terms.