ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway
Abstract
Large language models demonstrate limited mechanistic reasoning in toxicity prediction, necessitating benchmarking approaches that evaluate both predictive accuracy and biological plausibility of generated explanations.
Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biological mechanisms beyond chemical structure, necessitating mechanistic reasoning for reliable prediction. Despite its importance, current benchmarks fail to systematically evaluate this capability. LLMs can generate fluent but biologically unfaithful explanations, making it difficult to assess whether predicted toxicities are grounded invalid mechanisms. To bridge this gap, we introduce ToxReason, a benchmark grounded in the Adverse Outcome Pathway (AOP) that evaluates organ-level toxicity reasoning across multiple organs. ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO). Using ToxReason, we evaluate toxicity prediction performance and reasoning quality across diverse LLMs. We find that strong predictive performance does not necessarily imply reliable reasoning. Furthermore, we show that reasoning-aware training improves mechanistic reasoning and, consequently, toxicity prediction performance. Together, these results underscore the necessity of integrating reasoning into both evaluation and training for trustworthy toxicity modeling.
Community
As a follow-up to our previous work CoTox, we introduce ToxReason, a benchmark designed to evaluate whether LLMs can perform mechanistically faithful reasoning in chemical toxicity prediction.
While many models achieve strong predictive performance, it remains unclear whether their reasoning is biologically grounded. To address this, we propose a benchmark grounded in the Adverse Outcome Pathway (AOP) framework, enabling systematic evaluation of reasoning from molecular initiating events to adverse outcomes.
In addition, we present a model, ToxReason-4B that leverages AOP knowledge to improve toxicity prediction with more reliable and interpretable reasoning.
Get this paper in your agent:
hf papers read 2604.06264 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper