Instructions to use Xerv-AI/tarn with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Xerv-AI/tarn with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Xerv-AI/tarn")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Xerv-AI/tarn")
model = AutoModelForImageTextToText.from_pretrained("Xerv-AI/tarn")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Xerv-AI/tarn with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Xerv-AI/tarn"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xerv-AI/tarn",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Xerv-AI/tarn

SGLang

How to use Xerv-AI/tarn with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Xerv-AI/tarn" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xerv-AI/tarn",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Xerv-AI/tarn" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xerv-AI/tarn",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Unsloth Studio new

How to use Xerv-AI/tarn with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Xerv-AI/tarn to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Xerv-AI/tarn to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Xerv-AI/tarn to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="Xerv-AI/tarn",
    max_seq_length=2048,
)

Docker Model Runner
How to use Xerv-AI/tarn with Docker Model Runner:
```
docker model run hf.co/Xerv-AI/tarn
```

🌌 tarn (tarn-2b-vision-reasoning)

Developed by Xerv-AI, `tarn` is an optimized, ultra-compact 2-Billion parameter multimodal vision-language engine built upon the Qwen 3.5 VL architecture. By merging core perception mechanics with complex chain-of-thought data processing topologies, `tarn` is uniquely tailored for resource-constrained architectures, local deployments, and high-velocity streaming infrastructures requiring deep contextual visual comprehension.

📋 Table of Contents

Model Overview
Intended Architectural Uses & Scope
Memory & VRAM Footprint Benchmarks
Step-by-Step Google Colab Implementation
Streaming & Production Pipeline Setup
Training Topology & Data Lineage
Ethical Guardrails & Systemic Limitations

🧠 Model Overview

Unlike basic classification vision systems, `tarn` incorporates a native Chain-of-Thought (CoT) reasoning matrix. When faced with an image-text query, it executes an internal multi-layered analytical pass to self-correct and map spatial elements before formatting its final output. ### Key Technical Enhancements * Architectural Blueprint: Fine-tuned via Low-Rank Adaptation (LoRA) over the `unsloth/Qwen3.5-2B` base framework, maintaining architectural elasticity. * Dynamic Resolution Windowing: Supports bounded image tokenization via adjustable `min_pixels` and `max_pixels` scaling layers, eliminating sudden GPU out-of-memory (OOM) faults. * Advanced Token Processing: Utilizes specialized multimodal token sequence embeddings to seamlessly align image feature vectors into the foundational language space.

🎯 Intended Architectural Uses & Scope

Recommended Core Tasks

Visual Problem-Solving: Breaking down multi-step actions inside an image (e.g., troubleshooting complex wiring diagrams, reading mechanical dials).
Nuanced Image-Text Analysis: Generating dense, conceptually accurate descriptions of visual phenomena rather than superficial tags.
Complex Physics & Abstract Querying: Responding to interleaved queries requiring both text extraction (OCR), deep domain-specific knowledge, and physical reasoning (e.g., electrostatic properties, mechanics).

Out-of-Scope Deployments

Medical diagnostic automation without expert human verification loops.
Real-time automated safety-critical processing (autonomous vehicle controls, live weapons systems).
Generation of biometric verification data or high-stakes demographic filtering.

📊 Memory & VRAM Footprint Benchmarks

Due to the intense multi-dimensional matrix layout of Qwen 3.5's vision patches, native unconstrained generation can result in extreme VRAM spikes. tarn solves this by introducing dynamic spatial constraints.

Precision Level	Quantization State	Active Loading VRAM	Inference VRAM (Unbounded)	Optimized Bounded VRAM
Float16 (`fp16`)	None	~4.55 GB	~14.6 GB (OOM Risk)	~9.83 GB (Safe for T4)
Int4 (`4-bit`)	BitsAndBytes	~1.85 GB	~6.20 GB	~3.95 GB

💡 Core Recommendation: For edge deployments or free-tier Google Colab instances (Tesla T4 GPU with 15GB VRAM), always set execution patch limits between $256 \times 28 \times 28$ and $512 \times 28 \times 28$ pixels to guarantee stable, deterministic execution boundaries.

🚀 Step-by-Step Google Colab Implementation

To verify and run this model within a standard hardware sandbox environment, execute the blocks below.

1. Environment Initialization

Ensure your runtime is pointing to a hardware accelerator backend (T4 GPU). Install the bleeding-edge architecture updates from source:

# Force-install source versions supporting the qwen3_5 structural configuration
pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git)
pip install -q accelerate bitsandbytes torchvision qwen-vl-utils

Note: Make sure to navigate to Runtime -> Restart session after installation to initialize the new environment context.

2. Loading the Model Weights

import torch
from transformers import pipeline
model_id = "Xerv-AI/tarn"
print("Initializing tarn architecture pipelines...")
pipe = pipeline(
    "image-text-to-text", 
    model=model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
)
print("tarn is loaded and standing by.")

⚡ Streaming & Production Pipeline Setup

For real-time user-facing conversational products, buffering text generation hurts user experience. Use the TextStreamer implementation below to stream outputs token-by-token directly to your standard output array:

from transformers import TextStreamer
# Attach the text streamer interface to the pipeline core
streamer = TextStreamer(pipe.tokenizer, skip_prompt=True)
# Build a composite multimodal user payload
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image", 
                "url": "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG)"
            },
            {
                "type": "text", 
                "text": "Analyze the visual artifacts present in this image and define the principles of triboelectricity."
            }
        ]
    },
]
print("=== Initiating Real-Time Telemetry Stream ===")
outputs = pipe(
    text=messages, 
    max_new_tokens=1024, # Extend depth capability safely
    min_pixels=256*28*28, # Set baseline feature extraction map
    max_pixels=512*28*28, # Cap peak VRAM consumption upper bound
    generate_kwargs={"streamer": streamer}
)

🧬 Training Topology & Data Lineage

The training protocol of tarn was heavily engineered to break the paradigm of superficial visual question answering. It is optimized through a two-stage distillation and alignment process.

1. Dataset Dependencies

xerv-ai/tart (344k records): Provides core alignments on basic physics, electromagnetism, electrostatics, and real-world everyday sensory scenarios. It grounds the model's factual accuracy in high-density core domains.
Phase-Technologies/claude-reasoning-super (47.8k records): Instructs the model's internal decoder to prioritize complex hidden steps. Instead of outputting an immediately available guess, it structures the response using logical markdown hierarchies, self-corrections, and explicit calculations.

2. Hyperparameter Settings

Optimizer: AdamW (Learning Rate: 2 \times 10^{-4})
Weight Decay Coefficients: 0.01
Lr Scheduler Sequence: Linear warmup followed by cosine attenuation.
LoRA Rank (r): 64
LoRA Alpha (\alpha): 16
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

🛡️ Ethical Guardrails & Systemic Limitations

Hallucination Vectors: Like all generative vision systems, compressing multi-dimensional visual spaces into discrete texts can cause hallucinations if the image resolution is constrained too low (e.g., misreading small font sizes or highly dense numbers).
Bias Propagations: tarn can inherit underlying societal, technical, and taxonomic biases hidden inside the open source web data crawls forming its initial foundations.
Sycophancy Risks: Due to alignment patterns, if a prompt aggressively asserts a falsehood ("Why is there a dog in this picture of a ocean?"), the model may spend its initial reasoning block trying to justify the user's premise before correcting it.

📜 Citation & Attributions

@misc{tarn2026,
  author       = {Soham Pal and the Xerv-AI Research Team},
  title        = {tarn: Optimized Compact Multimodal Vision-Reasoning Engine},
  year         = {2026},
  publisher    = {Hugging Face Hub},
  howpublished = {\url{[https://huggingface.co/Xerv-AI/tarn](https://huggingface.co/Xerv-AI/tarn)}}
}

If you integrate tarn or your custom structural derivatives into enterprise frameworks, please attribute Xerv-AI accordingly. For additional questions or model contributions, open a pull request directly in the community repository channel.

Downloads last month: 62

Safetensors

Model size

2B params

Tensor type

F32

BF16

Model tree for Xerv-AI/tarn

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Finetuned

unsloth/Qwen3.5-2B

Finetuned

(111)

this model

Xerv-AI
/

tarn

🌌 tarn (tarn-2b-vision-reasoning)

📋 Table of Contents

🧠 Model Overview

🎯 Intended Architectural Uses & Scope

Recommended Core Tasks

Out-of-Scope Deployments

📊 Memory & VRAM Footprint Benchmarks

🚀 Step-by-Step Google Colab Implementation

1. Environment Initialization

2. Loading the Model Weights

⚡ Streaming & Production Pipeline Setup

🧬 Training Topology & Data Lineage

1. Dataset Dependencies

2. Hyperparameter Settings

🛡️ Ethical Guardrails & Systemic Limitations

📜 Citation & Attributions

Model tree for Xerv-AI/tarn

Datasets used to train Xerv-AI/tarn

Space using Xerv-AI/tarn 1