Raon-VisionEncoder is a 1.14B-parameter vision-language foundation model by KRAFTON for image and text feature extraction. It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex. Built on OpenCLIP with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder.

Pretrained Models

Model	Params (Inference)	Vision	Text	Patch Size	NaFlex Default Patches
LocCa ViT-SO400M-16-SigLIP2	1.14B	0.43B	0.71B	16x16	256

Requirements

pip install torch torchvision timm transformers huggingface-hub safetensors ftfy

Quick Start

import torch
from transformers import AutoModel
from PIL import Image

# Load model + processor
model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True)
model = model.to(dtype=torch.bfloat16).eval()
processor = model.get_processor("KRAFTON/Raon-VisionEncoder")

# Encode image and text
img_inputs = processor(images=Image.open("assets/photo.jpg"))
txt_inputs = processor(text=["a cat", "a dog"])

with torch.no_grad():
    img_feat = model.encode_image(**img_inputs)
    txt_feat = model.encode_text(**txt_inputs)

    # Compute similarity with learned scale and bias
    logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias
    probs = logits.softmax(dim=-1)
    print(probs)

API Reference

Method	Input	Output
`model.encode_image(**inputs)`	Processor output (image)	`[B, 1152]` normalized image features
`model.encode_text(**inputs)`	Processor output (text)	`[B, 1152]` normalized text features
`model.logit_scale`	-	Learned temperature parameter
`model.logit_bias`	-	Learned bias parameter
`model.get_processor(repo_id)`	HuggingFace repo ID	Processor instance
`processor(images=img)`	PIL Image	Preprocessed image dict
`processor(text=["a cat"])`	list of strings	Tokenized text dict

License

This repository is licensed under the Apache License 2.0. Third-party notices in NOTICE.

Downloads last month: 149

Collection including KRAFTON/Raon-VisionEncoder

Raon

Collection

8 items • Updated 1 day ago • 33