ko-vdr-preview

Korean visual document retrieval — 6 MTEB multimodal tasks (text→image).

A LoRA fine-tune of Qwen/Qwen3-VL-Embedding-2B trained on a mixed Korean/English VDR corpus with hard negatives mined by Qwen3-VL-Embedding-8B. Supports Matryoshka embeddings down to 128 dimensions (default: 2048).

Summary of Findings

  • Significant Improvement over 2B: ko-vdr-preview shows a massive performance uplift compared to the Qwen3-VL-2B baseline (e.g., ~0.48 vs ~0.35 avg nDCG@10).
  • Closing the Gap with 8B: The model's performance is remarkably close to the Qwen3-VL-8B model, offering near-state-of-the-art accuracy with much greater efficiency.

Usage

Install Dependencies

pip install -U sentence-transformers>=5.4.1
### Python code
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("johnandru/ko-vdr-preview")

# Run inference
queries = [
    '30인 이상 상용근로자를 보유한 기업의 1인당 평균 월별 법정외 복지비용이 10~29인 규모 기업보다 높은지 판단해 주세요'
]
documents = [
    'ko-vdr-public/3818.png',
    'ko-vdr-public/7753.png',
    'ko-vdr-public/3760.png'
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 2048] [3, 2048]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)

Matryoshka truncation

The model supports shortened embeddings via Matryoshka training. Supported dimensions: 2048, 1536, 1024, 768, 512, 256, 128 model = SentenceTransformer("johnandru/ko-vdr-preview", truncate_dim=512)

Model Details

Property Value
Base model Qwen/Qwen3-VL-Embedding-2B
Fine-tuning method LoRA (r=32, alpha=32, no dropout)
LoRA target modules q_proj, k_proj, v_proj, up_proj, down_proj, gate_proj
Embedding dimension 2048 (Matryoshka: 1536 / 1024 / 768 / 512 / 256 / 128)
Precision bfloat16
Attention Flash Attention 2
Max image pixels 1280 × 28 × 28
Framework sentence-transformers==5.4.1, peft>=0.19.1

Training

Data

Training used a multi-source Korean/English VDR dataset with hard negatives mined offline:

Source Language Type
NomaDamas/ko-vdr-train-public-v2.0 Korean Query–page pairs
whybe-choi/ko-vdr-train-private-v0.1 Korean Query–page pairs
vidore/colpali_train_set English Query–page pairs
tomaarsen/llamaindex-vdr-en-train-preprocessed English Query–page pairs
Ko/En text retrieval corpus Korean + English Text pairs

Hard negatives were mined with Qwen/Qwen3-VL-Embedding-8B using absolute_margin=0.05 and 7 negatives per pair (top sampling).

Loss

MatryoshkaLoss(SelfGuideCachedMultipleNegativesRankingLoss): InfoNCE with cosine similarity (scale=20), cached mini-batches (mini_batch_size=4), and Matryoshka multi-granularity weighting.

Evaluation

Task abbreviations

Short MTEB task
SDS-T2IT SDSKoPubVDRT2ITRetrieval
SDS-T2I SDSKoPubVDRT2IRetrieval
KV-Cyber KoVidore2CybersecurityRetrieval
KV-Econ KoVidore2EconomicRetrieval
KV-Energy KoVidore2EnergyRetrieval
KV-Hr KoVidore2HrRetrieval

Results - nDCG@10

rank model_name SDS-T2IT_nDCG@10 SDS-T2I_nDCG@10 KV-Cyber_nDCG@10 KV-Econ_nDCG@10 KV-Energy_nDCG@10 KV-Hr_nDCG@10 avg_nDCG@10
1 Qwen/Qwen3-VL-Embedding-8B 0.6999 0.6136 0.6857 0.2008 0.5415 0.2661 0.5013
2 (Ours) johnandru/ko-vdr-preview 0.6732 0.5623 0.6540 0.2139 0.5061 0.2975 0.4845
3 Qwen/Qwen3-VL-Embedding-2B 0.6605 0.2923 0.5359 0.1246 0.3565 0.1498 0.3533

Results - Recall@10

rank model_name SDS-T2IT_Recall@10 SDS-T2I_Recall@10 KV-Cyber_Recall@10 KV-Econ_Recall@10 KV-Energy_Recall@10 KV-Hr_Recall@10 avg_Recall@10
1 Qwen/Qwen3-VL-Embedding-8B 0.9033 0.7817 0.7527 0.2975 0.6059 0.3433 0.6141
2 (Ours) johnandru/ko-vdr-preview 0.8533 0.7500 0.7538 0.2868 0.5940 0.3847 0.6038
3 Qwen/Qwen3-VL-Embedding-2B 0.8650 0.4317 0.6012 0.1858 0.4166 0.1962 0.4494

Notes

  • All Qwen3-VL-Embedding family models loaded with max_pixels = 1280 * 28 * 28, bf16, flash-attention-2.
  • Prompt usage:
    • Qwen3-VL-Embedding 2B / 8B and our LoRA fine-tune: training prompt "Represent the user's input." (matches train.py).
  • LoRA fine-tune used peft 0.19.1 workaround in loader.py to inject lora_B weights (transformers 5.5.4 silently dropped them on from_pretrained for headless models — see PR huggingface/transformers#45428).
Downloads last month
41
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johnandru/ko-vdr-preview

Adapter
(7)
this model