Configuration Parsing Warning: Invalid JSON for config file config.json

Configuration Parsing Warning: Invalid JSON for config file tokenizer_config.json

HUMOR-RM (Keye-VL Version)

Model Summary

HUMOR-RM is a pairwise reward model designed to evaluate and rank the humor quality of internet memes. It serves as the preference model in the HUMOR (Hierarchical Understanding and Meme Optimization) framework.

This specific version is fine-tuned on Keye-VL, utilizing a dataset of pairwise meme comparisons (ranked by human annotators). It takes two memes (sharing the same template) as input and predicts which one is funnier, providing a consistent proxy for human preference.

Requirements

This model is built using the LLaMA-Factory framework structure. To run inference, you must have llamafactory installed.

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .

How to Use

Since this model uses a custom classification head on top of Keye-VL, we recommend using the provided wrapper class for inference.

1. Configuration (`config.yaml`)

Create a config.yaml file pointing to the base model and this adapter:

model_name_or_path: Kwai-Kolors/Keye-VL
adapter_name_or_path: path_to_this_repo  # or Local Path
template: keye  # Important: Must match Keye-VL template
trust_remote_code: true
finetuning_type: lora

2. Python Inference Code

import torch
import yaml
from llamafactory.hparams import get_infer_args
from llamafactory.model import load_tokenizer, get_template_and_fix_tokenizer
from llamafactory.model import AutoModelForBinaryClassification
from llamafactory.model.model_utils.classification_head import prepare_classification_model
from llamafactory.model.patcher import patch_classification_model
from transformers import AutoConfig, AutoModel

class MemeScorer:
    def __init__(self, config_path):
        with open(config_path) as f:
            config = yaml.safe_load(f)
        
        # Force RM configuration
        config.update({'stage': 'rm_class', 'finetuning_type': 'lora'})
        model_args, data_args, _, _ = get_infer_args(config)
        
        # 1. Load Tokenizer & Template
        tokenizer_mod = load_tokenizer(model_args)
        self.tokenizer = tokenizer_mod["tokenizer"]
        self.processor = tokenizer_mod.get("processor")
        self.template = get_template_and_fix_tokenizer(self.tokenizer, data_args)
        
        # 2. Load Base Model
        print("Loading Keye-VL Base...")
        self.model = AutoModel.from_pretrained(
            model_args.model_name_or_path, 
            trust_remote_code=True, 
            device_map="auto", 
            torch_dtype=torch.float16
        )
        
        # 3. Attach & Load Reward Head
        prepare_classification_model(self.model)
        self.model = AutoModelForBinaryClassification.from_pretrained(self.model)
        patch_classification_model(self.model)
        
        if model_args.adapter_name_or_path:
            self.model.load_classification_head(model_args.adapter_name_or_path[0])
            print("Loaded Humor Adapter.")
            
        self.model.eval()

    def score(self, img1_path, img2_path, prompt="Which meme is funnier?"):
        # Construct Input
        messages = [{"role": "user", "content": prompt}, {"role": "assistant", "content": ""}]
        images = [img1_path, img2_path]
        
        # Tokenize using Template
        proc_msgs = self.template.mm_plugin.process_messages(messages, images, [], [], self.processor)
        input_ids, _ = self.template.mm_plugin.process_token_ids([], [], images, [], [], self.tokenizer, self.processor)
        encoded = self.template.encode_multiturn(self.tokenizer, proc_msgs, None, None)
        input_ids += encoded[0][0]
        
        # Forward Pass
        inputs = {
            "input_ids": torch.tensor([input_ids]).to(self.model.device),
            "attention_mask": torch.tensor([[1]*len(input_ids)]).to(self.model.device),
            "images": [images] # Image processor handling depends on Keye-VL version
        }
        
        with torch.no_grad():
            logits = self.model(**inputs).logits.cpu().numpy()[0]
            
        # Logits: [Score_Pair_0, Score_Pair_1] (Depends on exact head config, usually prob(A>B))
        return logits

# Usage
if __name__ == "__main__":
    scorer = MemeScorer("config.yaml")
    scores = scorer.score("meme_a.jpg", "meme_b.jpg")
    print(f"Scores: {scores} (Winner: {'A' if scores[0] > scores[1] else 'B'})")

Intended Use

Group-wise Ranking: Evaluating a set of generated captions for a single meme template to select the best punchline.
RLHF/RLAIF: Providing reward signals for Reinforcement Learning training of meme generators.

Training Data

The model was trained on the HUMOR-Preference Dataset, which consists of 5 difficulty tiers of meme pairs:

Wrong Text: Original vs. Random text.
Wrong Location: Correct text vs. Misplaced text box.
Boring: Original vs. Non-humorous description.
Detailed Boring: Subtle text changes that kill the joke.
Generated: Fine-grained comparison between model-generated memes.

Citation

@article{li2025perception,
  title={From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme},
  author={Li, Xueyan and Xue, Yingyi and Jiang, Mengjie and Zhu, Qingzi and Niu, Yazhe},
  journal={arXiv preprint arXiv:2512.24555},
  year={2025}
}

Downloads last month: -

Safetensors

Model size

9B params

Tensor type

BF16

Paper for OpenDILabCommunity/HUMOR-RM-Keye-VL

From Perception to Punchline: Empowering VLM with the Art of In-the-wild Meme

Paper • 2512.24555 • Published 15 days ago