From Faithfulness to Correctness: Generative Reward Models that Think Critically

[📜 Paper] [🖥️ Code] [🤗 Hugging Face]

In this repository, we introduce the Thinking-supervised Reward Model (TRM): a sentence-level generative reward model that equips language models with critical thinking abilities. TRM enables stepwise reasoning—from document faithfulness to factual correctness—for Chinese question answering (QA) tasks with supporting documents.

Thinking-supervised Reward Model (TRM)

Given a query, answer, and supporting documents, TRM first evaluates the faithfulness of each answer sentence to the provided evidence. Based on this faithfulness assessment, TRM then applies a step-by-step reasoning framework to judge sentence-level correctness, explicitly modeling how each reasoning step aligns with both the external sources and the internal logic of the answer.

Policy Optimization

TRM is further incorporated into policy optimization within a reinforcement learning (RL) framework, where TRM ensures correctness and an auxiliary reward model addresses usefulness.

Getting Started

Please follow instructions in https://github.com/Martin-qyma/TRM for detailed implementation.

Downloads last month: 47

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for QiyaoMa/TRM

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

Finetuned

(82)

this model