metadata

language:
  - ens
license: other
license_name: cogvlm2
license_link: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/blob/main/LICENS
pipeline_tag: feature-extraction
tags:
  - chat
  - cogvlm2
inference: false

VisionReward-Image

Introduction

We present VisionReward, a general strategy to aligning visual generation models——both image and video generation——with human preferences through a fine-grainedand multi-dimensional framework. We decompose human preferences in images and videos into multiple dimensions,each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accuratescore. To address the challenges of video quality assess-ment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Here, we present the model of VisionReward-Image.

Merging and Extracting Checkpoint Files

Use the following command to merge the split files into a single .tar file and then extract it into the specified directory:

cat ckpts/split_part_* > ckpts/visionreward_image.tar
tar -xvf ckpts/visionreward_image.tar

Using this model

You can quickly install the Python package dependencies and run model inference in our github.

Usage

VQA (Vision-Question-Answering)

You can run the following commands for a checklist query. Available image and video questions can be found in VisionReward_Image/VisionReward_image_qa.txt and VisionReward_Video/VisionReward_video_qa.txt, respectively.

# For Image QA
python inference-image.py --bf16 --question [[your_question]]
# Input: image_path + prompt + question
# Output: yes/no

# For Video QA
python inference-video.py --question [[your_question]]
# Input: video_path + prompt + question
# Output: yes/no

Scoring with VisionReward

You can also calculate scores for images/videos with the following commands. The corresponding weights are in VisionReward_Image/weight.json and VisionReward_Video/weight.json

# Scoring an Image
python inference-image.py --bf16 --score 
# Input: image_path + prompt
# Output: score

# Scoring a Video
python inference-video.py --score
# Input: video_path + prompt
# Output: score

Compare Two Videos

It's also possible to directly compare the quality of two videos, leveraging the weights in VisionReward_Video/weight.json.

python inference-video.py --compare
# Input: video_path1 + video_path2 + prompt
# Output: better_video

This model utilizes fp32 precision parameters and requires the use of the sat (SwissArmyTransformer) library for invocation. For the bf16 (bfloat16) version of the model, please refer to the following link: https://huggingface.co/THUDM/VisionReward-Image-bf16