Refusal Token Models
Collection
This collection contains models described in the refusal token paper published in COLM 2025.
•
4 items
•
Updated
This model is a fine-tuned version of meta-llama/Meta-Llama-3-8B on UltraChat SFT, CoCoNoT refusals, and CoCoNoT's contrast data as SFT data. Note that this model is not the model found in the paper the original models are not able to be released due to corporate legalities.
For generating a output from this model, please refer to the code found in repo in the coconot_eval
folder. However, for this model, model.generate
or pipeline
are also sufficient.