HUMOR-RM (Keye-VL Version)
Model Summary
HUMOR-RM is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the HUMOR (Hierarchical Understanding and Meme Optimization) framework.
This specific version is fine-tuned on Keye-VL, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference.
Requirements
This model is built using the LLaMA-Factory framework structure. To run inference, you must have llamafactory installed.
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .
How to Use
Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference.
1. Configuration (config.yaml)
Create a config.yaml file pointing to the base model and this adapter:
model_name_or_path: Kwai-Kolors/Keye-VL
adapter_name_or_path: path_to_this_repo # or Local Path
template: keye # Important: Must match Keye-VL template
trust_remote_code: true
finetuning_type: lora
2. Python Inference Code
import torch
import yaml
from llamafactory.hparams import get_infer_args
from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer
from llamafactory.model import AutoModelForBinaryClassification
from llamafactory.model.model_utils.classification_head import prepare_classification_model
from llamafactory.model.patcher import patch_classification_model
from transformers import AutoConfig, AutoModel
class MemeScorer:
def __init__(self, config_path):
with open(config_path) as f:
config = yaml.safe_load(f)
# Force RM configuration
config.update({'stage': 'rm_class', 'finetuning_type': 'lora'})
model_args, data_args, _, _ = get_infer_args(config)
# 1. Load Tokenizer & Template
tokenizer_mod = load_tokenizer(model_args)
self.tokenizer = tokenizer_mod["tokenizer"]
self.processor = tokenizer_mod.get("processor")
self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args)
# 2. Load Base Model
print("Loading Keye-VL Base...")
self.model = AutoModel.from_pretrained(
model_args.model_name_or_path,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.float16
)
# 3. Attach & Load Reward Head
prepare_classification_model(self.model)
self.model = AutoModelForBinaryClassification.from_pretrained(self.model)
patch_classification_model(self.model)
if model_args.adapter_name_or_path:
self.model.load_classification_head(model_args.adapter_name_or_path[0])
print("Loaded Humor Adapter.")
self.model.eval()
def score(self, img1_path, img2_path, prompt="Which meme is funnier?"):
# Construct Input
messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}]
images = [img1_path, img2_path]
# Tokenize using Template
proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor)
input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor)
encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None)
input_ids += encoded[0][0]
# Forward Pass
inputs = {
"input_ids": torch.tensor([input_ids]).to(self.model.device),
"attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device),
"images": [images] # Image processor handling depends on Keye-VL version
}
with torch.no_grad():
logits = self.model(**inputs).logits.cpu().numpy()[0]
# Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B))
return logits
# Usage
if __name__ == "__main__":
scorer = MemeScorer("config.yaml")
scores = scorer.score("meme_a.jpg", "meme_b.jpg")
print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})")
Intended Use
- Group-wise Ranking: Evaluating a set of generated captions for a single meme template to select the best punchline.
- RLHF/RLAIF: Providing reward signals for Reinforcement Learning training of meme generators.
Training Data
The model was trained on the HUMOR-Preference Dataset, which consists of 5 difficulty tiers of meme pairs:
- Wrong Text: Original vs. Random text.
- Wrong Location: Correct text vs. Misplaced text box.
- Boring: Original vs. Non-humorous description.
- Detailed Boring: Subtle text changes that kill the joke.
- Generated: Fine-grained comparison between model-generated memes.
Citation
@article{li2025perception,
title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
journal={arXiv preprint arXiv:2512.24555},
year={2025}
}
- Downloads last month
- -