dPRM-14B

This model is a discriminative process reward model finetuned from DeepSeek-R1-Distill-Qwen-14B, and the training data is CoTs generated by Llama3.1-8B-Instruct (mostly adapted from VersaPRM).

For details:

Paper: Rethinking Reward Models for Multi-Domain Test-Time Scaling
Repository: https://github.com/db-Lee/Multi-RM

Direct Use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# tokenizer
def get_tokenizer(model_id):
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token  
    tokenizer.padding_side = 'left' 
    tokenizer.truncation_side = 'left'
    return tokenizer

tokenizer = get_tokenizer('dongboklee/dPRM-14B')
candidate_tokens = [
  self.tokenizer.encode("-", add_special_tokens=False)[-1], 
  self.tokenizer.encode("+", add_special_tokens=False)[-1]
]
tag_id = self.tokenizer.encode(" \n\n\n\n", add_special_tokens=False)[-1]

# model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = AutoModelForCausalLM.from_pretrained('dongboklee/dPRM-14B')
model.eval()
model.to(device)

question = 'Question: In Python 3, which of the following function convert a string to an int in python?\nA. short(x)\nB. float(x)\nC. integer(x [,base])\nD. double(x)\nE. int(x [,base])\nF. long(x [,base] )\nG. num(x)\nH. str(x)\nI. char(x)\nJ. digit(x [,base])'
solution = ["To convert a string to an integer in Python 3, we use the built-in function int().",
            "The int() function takes two arguments: the string to be converted and an optional base (default is 10, which is for decimal).",
            "For example: int(\"123\", 10) converts the string \"123\" to the integer 123.",
            "Looking at the options, we can see that the correct function is option E: int(x [,base]).",
            "The answer is (E)."]
input_text = question + ' \n\n' + ' \n\n\n\n'.join(solution) + ' \n\n\n\n' # solution steps are separated by ' \n\n\n\n'
input_id = torch.tensor([tokenizer.encode(input_text)]).to(device)

with torch.no_grad():
    logits = model(input_id).logits[:,:,candidate_tokens]
    scores = logits.softmax(dim=-1)[:,:,1] 
    step_scores = scores[input_id == tag_id]
    step_probs  = min(step_scores.tolist())

Citation

@article{multi-rm,
  title   = {Rethinking Reward Models for Multi-Domain Test-Time Scaling},
  author  = {Lee, Dong Bok and Lee, Seanie and Park, Sangwoo and Kang, Minki and Baek, Jinheon and Kim, Dongki and Wagner, Dominik and Jin, Jiongdao and Lee, Heejun and Bocklet, Tobias and Wang, Jinyu and Fu, Jingjing and Hwang, Sung Ju and Bian, Jiang and Song, Lei},
  journal = {arXiv preprint arXiv:2510.00492},
  year    = {2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for dongboklee/dPRM-14B

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-14B

Adapter

(43)

this model