Safetensors
modernbert
File size: 5,874 Bytes
0948786
 
 
 
f02450c
34696d5
1da6c31
34696d5
 
 
 
f02450c
 
4d7b18d
0948786
 
 
15576a9
0948786
 
 
 
 
 
 
 
4d7b18d
0948786
4391f77
0948786
4d7b18d
8c319e0
0948786
4d7b18d
0948786
4d7b18d
f02450c
 
0948786
13577c4
 
 
 
 
 
0948786
 
f876010
0948786
 
 
 
 
 
 
 
 
 
 
 
 
6bc1ac4
0948786
f876010
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc74977
f876010
0948786
 
 
 
 
 
 
 
 
f02450c
 
 
 
 
 
 
 
 
 
b8201c2
4d7b18d
 
 
 
b8201c2
4d7b18d
b8201c2
 
4d7b18d
b8201c2
0948786
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
---


# Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
[[📖 Paper](https://arxiv.org/abs/2506.15068)] [[github](https://github.com/zli12321/long_form_rl)] 


## About Open-Ended R1 Training
As open-ended long-form generation gains traction, reliably judging the quality of multi-sentence and paragraph-length outputs has become a major hurdle—traditional overlap metrics like ROUGE-L and BERTScore often miss nuances of coherence, style, and relevance, and can be skewed by pretraining biases. This leaves a critical gap in evaluation methods for guiding and training models that produce lengthy, free-form text.


<!-- # VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos

[Zongxia Li*](https://zli12321.github.io/), [Xiyang Wu*](https://wuxiyang1996.github.io/), [Yubin Qin](https://www.linkedin.com/in/yubin-qin/), [Guangyao Shi](https://guangyaoshi.github.io/), [Hongyang Du](https://www.linkedin.com/in/hongyangdu/), [Dinesh Manocha](https://www.cs.umd.edu/people/dmanocha), [Tianyi Zhou](https://tianyizhou.github.io/), [Jordan Lee Boyd-Graber](https://users.umiacs.umd.edu/~ying/)

[[📖 Paper](https://arxiv.org/abs/2505.01481)] [[🤗 Dataset](https://huggingface.co/datasets/IntelligenceLab/VideoHallu)][[🌍Website](https://wuxiyang1996.github.io/videohallu_page/)]


## 👀 About VideoHallu

With the recent success of video generation models such as [Sora](https://openai.com/sora/), [Veo2](https://veo2.ai), [Kling](https://www.klingai.com/global/), the visual quality of generated videos has reached new heights—making evaluation more challenging and pushing it beyond traditional metrics like frame consistency, resolution, and realism. However, we find that MLLMs struggle to detect abnormalities in generated videos, which is crucial for developing reliable automatic video evaluation methods.

We introduce VideoHallu, a curated dataset that includes videos generated by seven video generation models and a question-answer set to test MLLM's abilities to catch generated videos' abnormalities.

We also use GRPO to train [Qwen-2.5-VL-7B](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) on a subset of our dataset and show improvement on generated video understanding. -->



<!-- ## 🔥 News
- [2025/05/02] We release our datasets in [huggingface](https://huggingface.co/datasets/IntelligenceLab/VideoHallu)🤗.

 -->

## 🏅 <a name='rb'></a> 🔥 Reward Model
- RewardBert is specifically targeted for free-form GRPO training, where the answers cannot be evaluated based on simple correctness.
- We use [ModernBERT](https://huggingface.co/docs/transformers/en/model_doc/modernbert) as the base model to finetune on [MOCHA](https://arxiv.org/abs/2010.03636), [Prometheus-preference](https://huggingface.co/datasets/prometheus-eval/Preference-Collection), [Pedants](https://arxiv.org/abs/2402.11161) to evaluate free-form text generations. We use RewardBert as the reward in GRPO finetuning.

### Installation
```
## For more evaluation metrics, refer to https://github.com/zli12321/qa_metrics
pip install qa-metrics
```

#### Method: `compute_score`
**Parameters**
- `reference_answer` (str): gold (correct) answer to the question
- `candidate_answer` (str): The answer provided by a candidate that needs to be evaluated

**Returns**
- `tuple`: A tuple of normalized and raw scores.

```python
from qa_metrics.RewardBert import RewardBert

rb = RewardBert(device='cuda')
reference_answer = "The Frog Prince"
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
rb.compute_score(reference_answer, candidate_answer)
# (0.29113227128982544, 2.1645290851593018)
```


#### Method: `compute_batch_scores`
**Parameters**
- `reference_answers` (list of str): A list of gold (correct) answers to the question
- `candidate_answer` (list of str): A list of answers provided by a candidate that needs to be evaluated
- `batch_size` (int): batch size to predict (default 1)

**Returns**
- `tuple`: A tuple of a list of normalized and raw scores.

```python
from qa_metrics.RewardBert import RewardBert

rb = RewardBert(device='cuda')
reference_answer = ["The Frog Prince"]
candidate_answer = ["The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""]
rb.compute_batch_scores(reference_answer, candidate_answer, batch_size=1)
# ([0.29113227128982544], [2.1645290851593018])
```

## Acknowledgements

We sincerely appreciate the contributions of the open-source community. The related projects are as follows: [R1-V](https://github.com/Deep-Agent/R1-V) , [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) , [Video-R1](https://github.com/tulerfeng/Video-R1), [Qwen-2.5-VL](https://arxiv.org/abs/2502.13923)

## Citations

If you find our work helpful for your research, please consider citing our work.   

```
@misc{li2025semanticallyawarerewardsopenendedr1,
      title={Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation}, 
      author={Zongxia Li and Yapei Chang and Yuhang Zhou and Xiyang Wu and Zichao Liang and Yoo Yeon Sung and Jordan Lee Boyd-Graber},
      year={2025},
      eprint={2506.15068},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.15068}, 
}


## VLMs that use RewardBert as an evaluator 
@misc{li2025videohalluevaluatingmitigatingmultimodal,
      title={VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations for Synthetic Videos}, 
      author={Zongxia Li and Xiyang Wu and Yubin Qin and Guangyao Shi and Hongyang Du and Dinesh Manocha and Tianyi Zhou and Jordan Lee Boyd-Graber},
      year={2025},
      eprint={2505.01481},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.01481}, 
}
```