VARCO-VISION-2.0-1.7B

Introduction

VARCO-VISION-2.0 is a multimodal AI model capable of understanding both images and text to answer user queries. It supports multi-image inputs, enabling effective processing of complex content such as documents, tables, and charts. The model demonstrates strong comprehension in both Korean and English, with significantly improved text generation capabilities and a deeper understanding of Korean cultural context. Compared to its predecessor, performance has been notably enhanced across various benchmarks, and its usability in real-world scenariosโ€”such as everyday Q&A and information summarizationโ€”has also improved.

In addition to the 14B full-scale model, a lightweight 1.7B version is available for on-device use, making it accessible on personal devices such as smartphones and PCs. VARCO-VISION-2.0 is a powerful open-source AI model built for Korean users and is freely available for a wide range of applications.

๐ŸšจNews๐ŸŽ™๏ธ

  • ๐Ÿ“ฐ 2025-07-28: We released VARCO-VISION-2.0-1.7B-OCR at link
  • ๐Ÿ“ฐ 2025-07-28: We released VARCO-VISION-2.0-1.7B at link
  • ๐Ÿ“ฐ 2025-07-18: We updated the checkpoint of VARCO-VISION-2.0-14B for improved performance.
  • ๐Ÿ“ฐ 2025-07-16: We released VARCO-VISION-2.0-14B at link
  • ๐Ÿ“ฐ 2025-07-16: We released GME-VARCO-VISION-Embedding at link

Key Features

  • Multi-image Understanding: Newly added support for multi-image inputs enables the model to analyze multiple images simultaneously and make more holistic and context-aware decisions.
  • Korean Language Specialization: The model is further specialized for Korean, with a deeper understanding of Korean language, context, and culture. Korean text generation has been significantly improved, resulting in more natural, fluent, and accurate responses.
  • OCR with Text Localization: Unlike typical models that only recognize and generate text from images, VARCO-VISION-2.0 can also identify the position of the text and provide bounding boxes around it. This makes it especially useful for document understanding, signage interpretation, and structured visual data.
  • Enhanced Safety: The model now offers improved handling of harmful or sexually explicit content, ensuring safer and more reliable interactions.

VARCO-VISION-2.0 Family

Model Name Base Models (Vision / Language) HF Link
VARCO-VISION-2.0-14B siglip2-so400m-patch16-384 / Qwen3-14B link
VARCO-VISION-2.0-1.7B siglip2-so400m-patch16-384 / Qwen3-1.7B link
VARCO-VISION-2.0-1.7B-OCR siglip2-so400m-patch16-384 / Qwen3-1.7B link
GME-VARCO-VISION-Embedding Qwen2-VL-7B-Instruct link

Model Architecture

VARCO-VISION-2.0 follows the architecture of LLaVA-OneVision.

Evaluation

We used VLMEvalKit for evaluation whenever possible, and conducted our own implementations only for benchmarks not supported by the toolkit, ensuring fair comparisons with various open-source models. Please note that for certain benchmarks involving LLM-based evaluation (e.g., LLaVABench), results may not be exactly reproducible due to variations in the underlying LLM behavior.

Korean Benchmark

Benchmark InternVL3-2B Ovis2-2B VARCO-VISION-2.0-1.7B
K-MMBench_DEV 76.9 68.4 77.3
K-MMStar 50.1 10.9 45.3
K-SEED 69.2 34.5 70.5
K-LLaVABench 47.6 67.2 70.2
K-DTCBench 68.8 44.6 60.0
AVERAGE 62.5 45.1 64.7

English Benchmark

Benchmark InternVL3-2B Ovis2-2B VARCO-VISION-2.0-1.7B
MMStar 61.1 56.7 52.5
MMMU_VAL 48.7 45.6 44.3
MathVista 57.6 64.1 55.8
OCRBench 83.1 87.3 80.5
AI2D 78.6 82.7 75.1
HallusionBench 41.9 50.2 45.2
MMVet 67.0 58.3 54.5
SEEDBench_IMG 75.0 74.4 74.1
LLaVABench 72.1 76.6 72.4
RealWorldQA 65.1 66.0 65.4
POPE 90.1 87.8 87.5
ScienceQA_TEST 95.8 91.2 84.2
SEEDBench2_Plus 64.8 67.4 64.3
BLINK 53.1 47.9 46.3
TextVQA_VAL 78.6 80.0 75.7
ChartQA_TEST 76.0 81.4 72.3
Q-Bench1_VAL 71.9 76.3 73.3
A-Bench_VAL 74.3 76.2 72.2
DocVQA_TEST 88.2 91.9 81.6
InfoVQA_TEST 66.9 71.7 60.6
AVERAGE 70.5 71.7 66.9

Cultural Benchmark

Benchmark InternVL3-2B Ovis2-2B VARCO-VISION-2.0-1.7B
K-Viscuit 60.0 64.1 60.7
PangeaBench (ko) 66.2 63.1 60.7
PangeaBench 58.4 59.2 56.0

Text-only Benchmark

Benchmark InternVL3-2B Ovis2-2B VARCO-VISION-2.0-1.7B
MMLU 59.9 12.9 57.9
MT-Bench 6.28 6.14 7.33
KMMLU 38.0 31.1 17.5
KoMT-Bench 2.91 3.44 5.17
LogicKor 2.56 3.12 4.99

Note: Some models show unusually low performance on the MMLU benchmark. This is primarily due to their failure to correctly follow the expected output format when only few-shot exemplars are provided in the prompts. Please take this into consideration when interpreting the results.

OCR Benchmark

Benchmark PaddleOCR EasyOCR VARCO-VISION-2.0-1.7B
CORD 91.4 77.8 92.1
ICDAR2013 92.0 85.0 94.8
ICDAR2015 73.7 57.9 71.2

Usage

To use this model, we recommend installing transformers version 4.53.1 or higher. While it may work with earlier versions, using 4.53.1 or above is strongly recommended, especially to ensure optimal performance for the multi-image feature.

The basic usage is identical to LLaVA-OneVision:

import torch
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_name = "NCSOFT/VARCO-VISION-2.0-1.7B"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    attn_implementation="sdpa",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B/resolve/main/demo.jpg"},
            {"type": "text", "text": "๊ฐ ๋ฐ•์Šค๋งˆ๋‹ค ํ•œ ์ค„์”ฉ ์ƒ‰์ƒ๊ณผ ๊ธ€์ž๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ถœ๋ ฅํ•ด์ฃผ์„ธ์š”."},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=1024)
generate_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
]
output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=True)
print(output)
Multi image inference
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "์ด๋ฏธ์ง€ ๊ฐ„์˜ ์œ ์‚ฌ์ ์„ ํŒŒ์•…ํ•˜์„ธ์š”."},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=1024)
generate_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
]
output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=True)
print(output)
Batch inference All inputs in a batch must have the same modality structureโ€”for example, text-only with text-only, single-image with single-image, and multi-image with multi-imageโ€”to ensure correct batch inference.
conversation_1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "text", "text": "์ด๋ฏธ์ง€๋ฅผ ์„ค๋ช…ํ•ด์ฃผ์„ธ์š”."},
        ],
    },
]
conversation_2 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "์ด ์ด๋ฏธ์ง€์— ํ‘œ์‹œ๋œ ๊ฒƒ์€ ๋ฌด์—‡์ธ๊ฐ€์š”?"},
        ],
    },
]
inputs = processor.apply_chat_template(
    [conversation_1, conversation_2],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    padding=True,
    return_tensors="pt"
).to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=1024)
generate_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
]
output = processor.batch_decode(generate_ids_trimmed, skip_special_tokens=True)
print(output)
OCR inference
from PIL import Image
image = Image.open("file:///path/to/image.jpg")
# Image upscaling for OCR performance boost
w, h = image.size
target_size = 2304
if max(w, h) < target_size:
    scaling_factor = target_size / max(w, h)
    new_w = int(w * scaling_factor)
    new_h = int(h * scaling_factor)
    image = image.resize((new_w, new_h))
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": ""},
        ],
    },
]
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=1024)
generate_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)
]
output = processor.decode(generate_ids_trimmed[0], skip_special_tokens=False)
print(output)
Downloads last month
1,923
Safetensors
Model size
2.12B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for NCSOFT/VARCO-VISION-2.0-1.7B

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(167)
this model