SigLIP 2 - Fine-tuned for Spectrum Icons

This repository hosts a fine-tuned checkpoint derived from google/siglip2-base-patch16-naflex. The model keeps the SigLIP2 architecture and tokenizer from the base checkpoint and is optimized for: Image-text retrieval and caption alignment for Spectrum iconography.

Model Sources

Training Data

  • Spectrum icon set captions (internal).

Training Configuration

Phase 1

  • num_train_epochs: 32.0
  • learning_rate: 3e-05
  • per_device_train_batch_size: 144
  • gradient_accumulation_steps: 1
  • warmup_ratio: 0.05
  • weight_decay: 0.05
  • save_strategy: steps
  • eval_strategy: steps

Phase 2 Fine-tuning Hyperparameters

  • num_train_epochs: 8.0
  • learning_rate: 1e-05
  • per_device_train_batch_size: 144
  • gradient_accumulation_steps: 1
  • warmup_ratio: 0.02
  • weight_decay: 0.05
  • save_strategy: steps
  • eval_strategy: steps

How to Use

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image

processor = AutoProcessor.from_pretrained("JianLiao/siglip2-spectrum-icons-naflex", use_fast=False)
model = AutoModel.from_pretrained("JianLiao/siglip2-spectrum-icons-naflex", dtype=torch.float16, attn_implementation="sdpa")

image = Image.open("./image.png").convert("RGB")
inputs = processor(
    text=["display forecast", "Crystal ball with a small sparkle", "show prediction", "Minimalist fortune-telling orb on a stand", "Monochrome magic globe with star accent"],
    images=[image],
    return_tensors="pt",
    padding="max_length",
    max_num_patches=256
)

with torch.no_grad():
    outputs = model(**inputs)

image_embeds = outputs.vision_model_output.pooler_output
text_embeds = outputs.text_model_output.pooler_output
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
similarity = text_embeds @ image_embeds.T
print(similarity)

CPU example output:

tensor([[0.1677],
        [0.0732],
        [0.1676],
        [0.1084],
        [0.1381]], dtype=torch.float16)

Captions 1/3 rank highest for the icon; caption 5 remains competitive without losing descriptiveness. Sample icon: crystal-ball

Limitations

  • Tuned for icon imagery; performance on natural images is not evaluated.
  • Captions are domain-specific and concise; long-form text may not align well.

Intended Use

  • Icon search and retrieval: rank Spectrum-style icons by text queries (design intent or UI labels).
  • Caption verification: check alignment between proposed captions and icon visuals in QA pipelines.
  • Embedding export: produce text/image embeddings for downstream vector search in design tooling.

Changelog

  • 2025-11-26: Initial upload fine-tuned from google/siglip2-base-patch16-naflex.
Downloads last month
250
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for JianLiao/siglip2-spectrum-icons-naflex

Finetuned
(1)
this model