whitead commited on
Commit
cc9e386
·
verified ·
1 Parent(s): 33b392c

Initial README

Browse files
Files changed (1) hide show
  1. README.md +63 -3
README.md CHANGED
@@ -1,3 +1,63 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - mistralai/Mistral-Small-3.1-24B-Instruct-2503
7
+ pipeline_tag: text-generation
8
+ tags:
9
+ - smiles
10
+ - chemistry
11
+ - reasoning
12
+ ---
13
+
14
+ # ehter0
15
+
16
+ ether0 is a 24B language model trained to reason in English and output molecular structures as SMILES.
17
+ It is derived from fine-tuning and reinforcement learning training from Mistral-Small-3.1-24B-Instruct-2503.
18
+ Ask questions in English, but they may also include molecules specified as SMILES. The SMILES need not be canonical.
19
+ ehter0 has limited support for IUPAC names.
20
+
21
+ ## Usage
22
+ This model is trained to reason in English and output a molecule.
23
+ It is NOT a general purpose chat model.
24
+ It has been trained specifically for these tasks:
25
+
26
+ * IUPAC-names
27
+ * formulas to structures
28
+ * modifying solubilities by speciifc LogS
29
+ * constrained edits (e.g., do not affect group X or do not affect scaffold)
30
+ * pKA
31
+ * smell/scent
32
+ * human cell receptor binding + mode (e.g., agonist)
33
+ * ADME properties (e.g., MDDK efflux ratio, LD50)
34
+ * GHS classifications (as words, not codes, like "carcinogen")
35
+ * some electronic properties
36
+ * 1-step retrosynthesis
37
+ * reaction outcome prediction
38
+ * natural language caption to molecule
39
+ * natural product elucidation (formula + organism to SMILES)
40
+ * blood-brain barrier permeability
41
+
42
+ For example, you can ask "Propose a molecule with a pKa of 9.2" or "Modify CCCCC(O)=OH to increase its pKa by about 1 unit." You cannot ask it "What is the pKa of CCCCC(O)=OH?"
43
+ If you ask it questions that lie significantly beyond those tasks, it can fail.
44
+
45
+ ## Limitations
46
+
47
+ It does not know general synonyms and it has poor textbook knowledge (e.g. it does not perform especially well on chembench).
48
+ For best results, input molecules as SMILES: if you input molecules with their common names, the model may reason using the incorrect smiles, resulting in poor results.
49
+ For example, we have observed that the model often confuses lysine and glutamic acid if you ask questions using their common names, but should correctly reason about their chemistry if you provide their structures as SMILES.
50
+
51
+ ## Training data and details
52
+
53
+ See our [preprint](arxiv.org) for details on data and training process.
54
+
55
+ ## Safety
56
+
57
+ We performed refusal post-training for compounds listed on OPCW schedules 1 and 2. As the model knows pharmacokinetics, it can modulate toxicity.
58
+ As the structure of toxic or narcotic compounds are generally known, we do not consider this a significant safety risk. The model can provide
59
+ no uplift on "tacit knowledge" tasks like purification, scale-up, or processing beyond a web search or similar sized language model.
60
+
61
+ ## License
62
+
63
+ Open-weights (Apache 2.0) for any use.