From Faithfulness to Correctness: Generative Reward Models that Think Critically
[📜 Paper] [🖥️ Code] [🤗 Hugging Face]
In this repository, we introduce the Thinking-supervised Reward Model (TRM): a sentence-level generative reward model that equips language models with critical thinking abilities. TRM enables stepwise reasoning—from document faithfulness to factual correctness—for Chinese question answering (QA) tasks with supporting documents.
Thinking-supervised Reward Model (TRM)
Given a query, answer, and supporting documents, TRM first evaluates the faithfulness of each answer sentence to the provided evidence. Based on this faithfulness assessment, TRM then applies a step-by-step reasoning framework to judge sentence-level correctness, explicitly modeling how each reasoning step aligns with both the external sources and the internal logic of the answer.
Policy Optimization
TRM is further incorporated into policy optimization within a reinforcement learning (RL) framework, where TRM ensures correctness and an auxiliary reward model addresses usefulness.
Getting Started
Please follow instructions in https://github.com/Martin-qyma/TRM for detailed implementation.
- Downloads last month
- 47
Model tree for QiyaoMa/TRM
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B