Raon VisionEncoder

Homepage
Hugging Face X
License

Raon-VisionEncoder is a 1.14B-parameter vision-language foundation model by KRAFTON for image and text feature extraction. It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex. Built on OpenCLIP with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder.

Pretrained Models

Model Params (Inference) Vision Text Patch Size NaFlex Default Patches
LocCa ViT-SO400M-16-SigLIP2 1.14B 0.43B 0.71B 16x16 256

Requirements

pip install torch torchvision timm transformers huggingface-hub safetensors ftfy

Quick Start

import torch
from transformers import AutoModel
from PIL import Image

# Load model + processor
model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True)
model = model.to(dtype=torch.bfloat16).eval()
processor = model.get_processor("KRAFTON/Raon-VisionEncoder")

# Encode image and text
img_inputs = processor(images=Image.open("assets/photo.jpg"))
txt_inputs = processor(text=["a cat", "a dog"])

with torch.no_grad():
    img_feat = model.encode_image(**img_inputs)
    txt_feat = model.encode_text(**txt_inputs)

    # Compute similarity with learned scale and bias
    logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias
    probs = logits.softmax(dim=-1)
    print(probs)

API Reference

Method Input Output
model.encode_image(**inputs) Processor output (image) [B, 1152] normalized image features
model.encode_text(**inputs) Processor output (text) [B, 1152] normalized text features
model.logit_scale - Learned temperature parameter
model.logit_bias - Learned bias parameter
model.get_processor(repo_id) HuggingFace repo ID Processor instance
processor(images=img) PIL Image Preprocessed image dict
processor(text=["a cat"]) list of strings Tokenized text dict

License

This repository is licensed under the Apache License 2.0. Third-party notices in NOTICE.

© 2026 KRAFTON

Downloads last month
149
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including KRAFTON/Raon-VisionEncoder