arxiv:2604.06264

ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway

Published on Apr 7

Authors:

Abstract

Large language models demonstrate limited mechanistic reasoning in toxicity prediction, necessitating benchmarking approaches that evaluate both predictive accuracy and biological plausibility of generated explanations.

AI-generated summary

Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biological mechanisms beyond chemical structure, necessitating mechanistic reasoning for reliable prediction. Despite its importance, current benchmarks fail to systematically evaluate this capability. LLMs can generate fluent but biologically unfaithful explanations, making it difficult to assess whether predicted toxicities are grounded invalid mechanisms. To bridge this gap, we introduce ToxReason, a benchmark grounded in the Adverse Outcome Pathway (AOP) that evaluates organ-level toxicity reasoning across multiple organs. ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO). Using ToxReason, we evaluate toxicity prediction performance and reasoning quality across diverse LLMs. We find that strong predictive performance does not necessarily imply reliable reasoning. Furthermore, we show that reasoning-aware training improves mechanistic reasoning and, consequently, toxicity prediction performance. Together, these results underscore the necessity of integrating reasoning into both evaluation and training for trustworthy toxicity modeling.

View arXiv page View PDF Add to collection

Community

bioai96

1 day ago

As a follow-up to our previous work CoTox, we introduce ToxReason, a benchmark designed to evaluate whether LLMs can perform mechanistically faithful reasoning in chemical toxicity prediction.

While many models achieve strong predictive performance, it remains unclear whether their reasoning is biologically grounded. To address this, we propose a benchmark grounded in the Adverse Outcome Pathway (AOP) framework, enabling systematic evaluation of reasoning from molecular initiating events to adverse outcomes.

In addition, we present a model, ToxReason-4B that leverages AOP knowledge to improve toxicity prediction with more reliable and interpretable reasoning.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.06264

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.06264 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.06264 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.06264 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.