dongboklee commited on
Commit
982942d
·
verified ·
1 Parent(s): eb0435a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -15,7 +15,7 @@ language:
15
  # dORM-14B
16
 
17
 
18
- This model is a generative outcome reward model finetuned from [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), and the [training data](https://huggingface.co/datasets/dongboklee/train) is CoTs generated by [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) (mostly adapted from [VersaPRM](https://github.com/UW-Madison-Lee-Lab/VersaPRM)).
19
 
20
  For details:
21
  - **Paper:** [Rethinking Reward Models for Multi-Domain Test-Time Scaling](https://arxiv.org/abs/2510.00492)
 
15
  # dORM-14B
16
 
17
 
18
+ This model is a generative outcome reward model finetuned from [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), and the [training data](https://huggingface.co/datasets/dongboklee/train_gORM) is generated by [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) on [this data](https://huggingface.co/datasets/dongboklee/train).
19
 
20
  For details:
21
  - **Paper:** [Rethinking Reward Models for Multi-Domain Test-Time Scaling](https://arxiv.org/abs/2510.00492)