--- base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B library_name: transformers pipeline_tag: text-generation tags: - base_model:adapter:deepseek-ai/DeepSeek-R1-Distill-Qwen-14B - lora - transformers - reward-model license: apache-2.0 language: - en --- # gORM-14B This model is a generative outcome reward model finetuned from [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B), and the [training data](https://huggingface.co/datasets/dongboklee/train_gORM) is generated by [QwQ-32B](https://huggingface.co/Qwen/QwQ-32B) on [this data](https://huggingface.co/datasets/dongboklee/train). For details: - **Paper:** [Rethinking Reward Models for Multi-Domain Test-Time Scaling](https://huggingface.co/papers/2510.00492) - **Repository:** [https://github.com/db-Lee/Multi-RM](https://github.com/db-Lee/Multi-RM) ### Direct Use ```python import math import torch from transformers import AutoModelForCausalLM, AutoTokenizer # tokenizer tokenizer = AutoTokenizer.from_pretrained('dongboklee/gORM-14B') yes_id = tokenizer.encode(" Yes", add_special_tokens=False)[-1] no_id = tokenizer.encode(" No", add_special_tokens=False)[-1] # model device = 'cuda' if torch.cuda.is_available() else 'cpu' model = AutoModelForCausalLM.from_pretrained('dongboklee/gORM-14B') model.eval() model.to(device) # prompt formatting question = 'Question: In Python 3, which of the following function convert a string to an int in python?\nA. short(x)\nB. float(x)\nC. integer(x [,base])\nD. double(x)\nE. int(x [,base])\nF. long(x [,base] )\nG. num(x)\nH. str(x)\nI. char(x)\nJ. digit(x [,base])' solution = ["To convert a string to an integer in Python 3, we use the built-in function int().", "The int() function takes two arguments: the string to be converted and an optional base (default is 10, which is for decimal).", "For example: int(\"123\", 10) converts the string \"123\" to the integer 123.", "Looking at the options, we can see that the correct function is option E: int(x [,base]).", "The answer is (E)."] category_name = "computer science" prefix = "\n\n".join(solution) # Create the prompt prompt_text = ( f"You are a {category_name} teacher. Grade the solution, verifying correctness step by step.\n" "At the end of Solution verification, when you give your final grade, write it in the form \"Verification: Is the answer correct (Yes/No)? X\", where X is either Yes or No.\n\n" f"[{category_name.capitalize()} Problem]\n{question.strip()}\n\n" f"[Solution]\n{prefix.strip()}\n" ) prompt = tokenizer.apply_chat_template( [{'role': "user", "content": prompt_text}], tokenize=False, add_generation_prompt=True, add_special_tokens=False ) + "Let's verify step by step:" # Tokenize the prompt inputs = tokenizer(prompt, return_tensors="pt").to(device) # generate with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=8192, return_dict_in_generate=True, output_scores=True, pad_token_id=tokenizer.eos_token_id ) # compute reward logits = outputs.logits[0, -2, :] yes_logit, no_logit = logits[yes_id].item(), logits[no_id].item() reward = math.exp(yes_logit) / (math.exp(yes_logit) + math.exp(no_logit)) ``` ## Citation ``` @article{multi-rm, title = {Rethinking Reward Models for Multi-Domain Test-Time Scaling}, author = {Lee, Dong Bok and Lee, Seanie and Park, Sangwoo and Kang, Minki and Baek, Jinheon and Kim, Dongki and Wagner, Dominik and Jin, Jiongdao and Lee, Heejun and Bocklet, Tobias and Wang, Jinyu and Fu, Jingjing and Hwang, Sung Ju and Bian, Jiang and Song, Lei}, journal = {arXiv preprint arXiv:2510.00492}, year = {2025} } ```