Is RewardAnything also effective for model evaluation and comparison? Like llm as a judge?
reward model and model evaluation is quite similar. Some literature even use RewardBench as a benchmark for model evaluation
Yes it could be used directly as a LLM-as-a-judge model. However, you should use it along with some prompt dataset in your desired domain and ensure the effectiveness, reliability and integrity of the dataset. We'll include more details on this in our next version.
感谢回复。不过有篇论文Improving LLM-as-a-Judge Inference with the Judgment Distribution提到用用思维链(CoT)提示的llm-as-a-judge会使得使判断分布变得“尖锐”或“坍缩”,减少了分布的离散度,从而丢失了有价值的细微偏好信息 。我没在benchmark上面跑过,但是简单测试了一下您用大模型微调后的奖励模型,也同样会出现论文中提到的"scores": {"model-1": 下一个token在1,2,3,4,5上的logits会塌缩到一个具体值,使得它的prob几乎为1的情况,我把temperature调整为10才能让最高token的prob从1变为0.95,这似乎不太好解释。
This should be an expected behavior. Since you are probably calculating logprobs given a CoT reasoning chain which already contains model's thoughts on quality and scores of the responses, it would be very natural to keep a consistent rating in the final JSON object.
For instance, if the output looks like:
<think> ... model-1 is bad and should get the lowest score: 1. model-2 should ... </think>
{
"scores": {"model-1":
It would only be correct if the logprobs all focus on the token 1, and thus according to your description the behavior is expected.
In fact, if you would like to use it to calculate logprobs or distribution of scores, you should run something like rejection sampling.
Thanks for reminding!
I experimented on a prompt and its response for multiple times and find that only a few cases an explicit score will appear in model's thoughts. In most cases, model's thoughts only reflect a relationship, like model A is better than model B. But the logprobs always fail. While in the multiple runs, the CoT step may differ and result in slight differences in final scores. May that's the only and adequate way to reflect probs for 1 to 5.
The deepseek grm paper mentioned about the Inference-Time Scaling for Reward Modeling and I guess this still works for RewardAnything.
I wonder why you are trying to obtain logprobs though, as it is not the intended usage.
If you are trying to get logprobs for score tokens to do RL, with algorithms like GRPO, I would recommend directly parsing the JSON outputs and convert the scores and rankings for a dense reward. In fact, something as simple as this would work:
group_scores = result.get("scores", {})
group_best_to_worst = result.get("best-to-worst", [])
group_scores_normalized = {prompt_id: score / 5.0 for prompt_id, score in group_scores.items()}
last_score = -1
for prompt_id in group_best_to_worst:
if group_scores_normalized[prompt_id] == last_score:
# penalize a slightly bit, for same scored responses, because previous one is ranked higher
group_scores_normalized[prompt_id] = last_score - 0.02
last_score = group_scores_normalized[prompt_id]
One single LLM call could produce rewards for all responses within the rollout group, and this is what we did in our case study, which turned out to be effective.
Well my goal is for not for rlhf, instead is an empirical study related on LLM evaluation using my statistical method, which briefly speaking is to do quantile estimation on scores of different LLMs and probably apply somthing like A/B test. So the question is most SOTA reward models or LLM-as-a-judge paradigm usually only provide a discrete score and the paper mentioned above called 'Improving LLM-as-a-Judge Inference with the Judgment Distribution' implies integrating distribution of scores works.
(P.S. This is my graduate thesis and my mentor, whose research interest doesn't include LLM, questions whether the quantile-based evaluation of LLM really works in cases like low-quantile LLM performance for safety reasons and high-quantile LLM performance for testing best performance of a LLM.)