From Faithfulness to Correctness: Generative Reward Models that Think Critically

[📜 Paper] [🖥️ Code] [🤗 Hugging Face]

In this repository, we introduce the Thinking-supervised Reward Model (TRM): a sentence-level generative reward model that equips language models with critical thinking abilities. TRM enables stepwise reasoning—from document faithfulness to factual correctness—for Chinese question answering (QA) tasks with supporting documents.

Thinking-supervised Reward Model (TRM)

Given a query, answer, and supporting documents, TRM first evaluates the faithfulness of each answer sentence to the provided evidence. Based on this faithfulness assessment, TRM then applies a step-by-step reasoning framework to judge sentence-level correctness, explicitly modeling how each reasoning step aligns with both the external sources and the internal logic of the answer.

Policy Optimization

TRM is further incorporated into policy optimization within a reinforcement learning (RL) framework, where TRM ensures correctness and an auxiliary reward model addresses usefulness.

Getting Started

Please follow instructions in https://github.com/Martin-qyma/TRM for detailed implementation.

Downloads last month
47
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for QiyaoMa/TRM

Finetuned
(82)
this model