🌌 tarn (tarn-2b-vision-reasoning)

Developed by Xerv-AI, tarn is an optimized, ultra-compact 2-Billion parameter multimodal vision-language engine built upon the Qwen 3.5 VL architecture. By merging core perception mechanics with complex chain-of-thought data processing topologies, tarn is uniquely tailored for resource-constrained architectures, local deployments, and high-velocity streaming infrastructures requiring deep contextual visual comprehension.

📋 Table of Contents

  1. Model Overview
  2. Intended Architectural Uses & Scope
  3. Memory & VRAM Footprint Benchmarks
  4. Step-by-Step Google Colab Implementation
  5. Streaming & Production Pipeline Setup
  6. Training Topology & Data Lineage
  7. Ethical Guardrails & Systemic Limitations

🧠 Model Overview

Unlike basic classification vision systems, tarn incorporates a native Chain-of-Thought (CoT) reasoning matrix. When faced with an image-text query, it executes an internal multi-layered analytical pass to self-correct and map spatial elements before formatting its final output. ### Key Technical Enhancements * Architectural Blueprint: Fine-tuned via Low-Rank Adaptation (LoRA) over the unsloth/Qwen3.5-2B base framework, maintaining architectural elasticity. * Dynamic Resolution Windowing: Supports bounded image tokenization via adjustable min_pixels and max_pixels scaling layers, eliminating sudden GPU out-of-memory (OOM) faults. * Advanced Token Processing: Utilizes specialized multimodal token sequence embeddings to seamlessly align image feature vectors into the foundational language space.

🎯 Intended Architectural Uses & Scope

Recommended Core Tasks

  • Visual Problem-Solving: Breaking down multi-step actions inside an image (e.g., troubleshooting complex wiring diagrams, reading mechanical dials).
  • Nuanced Image-Text Analysis: Generating dense, conceptually accurate descriptions of visual phenomena rather than superficial tags.
  • Complex Physics & Abstract Querying: Responding to interleaved queries requiring both text extraction (OCR), deep domain-specific knowledge, and physical reasoning (e.g., electrostatic properties, mechanics).

Out-of-Scope Deployments

  • Medical diagnostic automation without expert human verification loops.
  • Real-time automated safety-critical processing (autonomous vehicle controls, live weapons systems).
  • Generation of biometric verification data or high-stakes demographic filtering.

📊 Memory & VRAM Footprint Benchmarks

Due to the intense multi-dimensional matrix layout of Qwen 3.5's vision patches, native unconstrained generation can result in extreme VRAM spikes. tarn solves this by introducing dynamic spatial constraints.

Precision Level Quantization State Active Loading VRAM Inference VRAM (Unbounded) Optimized Bounded VRAM
Float16 (fp16) None ~4.55 GB ~14.6 GB (OOM Risk) ~9.83 GB (Safe for T4)
Int4 (4-bit) BitsAndBytes ~1.85 GB ~6.20 GB ~3.95 GB

💡 Core Recommendation: For edge deployments or free-tier Google Colab instances (Tesla T4 GPU with 15GB VRAM), always set execution patch limits between $256 \times 28 \times 28$ and $512 \times 28 \times 28$ pixels to guarantee stable, deterministic execution boundaries.


🚀 Step-by-Step Google Colab Implementation

To verify and run this model within a standard hardware sandbox environment, execute the blocks below.

1. Environment Initialization

Ensure your runtime is pointing to a hardware accelerator backend (T4 GPU). Install the bleeding-edge architecture updates from source:

# Force-install source versions supporting the qwen3_5 structural configuration
pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git)
pip install -q accelerate bitsandbytes torchvision qwen-vl-utils

Note: Make sure to navigate to Runtime -> Restart session after installation to initialize the new environment context.

2. Loading the Model Weights

import torch
from transformers import pipeline
model_id = "Xerv-AI/tarn"
print("Initializing tarn architecture pipelines...")
pipe = pipeline(
    "image-text-to-text", 
    model=model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
)
print("tarn is loaded and standing by.")

⚡ Streaming & Production Pipeline Setup

For real-time user-facing conversational products, buffering text generation hurts user experience. Use the TextStreamer implementation below to stream outputs token-by-token directly to your standard output array:

from transformers import TextStreamer
# Attach the text streamer interface to the pipeline core
streamer = TextStreamer(pipe.tokenizer, skip_prompt=True)
# Build a composite multimodal user payload
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image", 
                "url": "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG)"
            },
            {
                "type": "text", 
                "text": "Analyze the visual artifacts present in this image and define the principles of triboelectricity."
            }
        ]
    },
]
print("=== Initiating Real-Time Telemetry Stream ===")
outputs = pipe(
    text=messages, 
    max_new_tokens=1024, # Extend depth capability safely
    min_pixels=256*28*28, # Set baseline feature extraction map
    max_pixels=512*28*28, # Cap peak VRAM consumption upper bound
    generate_kwargs={"streamer": streamer}
)

🧬 Training Topology & Data Lineage

The training protocol of tarn was heavily engineered to break the paradigm of superficial visual question answering. It is optimized through a two-stage distillation and alignment process.

1. Dataset Dependencies

  • xerv-ai/tart (344k records): Provides core alignments on basic physics, electromagnetism, electrostatics, and real-world everyday sensory scenarios. It grounds the model's factual accuracy in high-density core domains.
  • Phase-Technologies/claude-reasoning-super (47.8k records): Instructs the model's internal decoder to prioritize complex hidden steps. Instead of outputting an immediately available guess, it structures the response using logical markdown hierarchies, self-corrections, and explicit calculations.

2. Hyperparameter Settings

  • Optimizer: AdamW (Learning Rate: 2 \times 10^{-4})
  • Weight Decay Coefficients: 0.01
  • Lr Scheduler Sequence: Linear warmup followed by cosine attenuation.
  • LoRA Rank (r): 64
  • LoRA Alpha (\alpha): 16
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

🛡️ Ethical Guardrails & Systemic Limitations

  • Hallucination Vectors: Like all generative vision systems, compressing multi-dimensional visual spaces into discrete texts can cause hallucinations if the image resolution is constrained too low (e.g., misreading small font sizes or highly dense numbers).
  • Bias Propagations: tarn can inherit underlying societal, technical, and taxonomic biases hidden inside the open source web data crawls forming its initial foundations.
  • Sycophancy Risks: Due to alignment patterns, if a prompt aggressively asserts a falsehood ("Why is there a dog in this picture of a ocean?"), the model may spend its initial reasoning block trying to justify the user's premise before correcting it.

📜 Citation & Attributions

@misc{tarn2026,
  author       = {Soham Pal and the Xerv-AI Research Team},
  title        = {tarn: Optimized Compact Multimodal Vision-Reasoning Engine},
  year         = {2026},
  publisher    = {Hugging Face Hub},
  howpublished = {\url{[https://huggingface.co/Xerv-AI/tarn](https://huggingface.co/Xerv-AI/tarn)}}
}

If you integrate tarn or your custom structural derivatives into enterprise frameworks, please attribute Xerv-AI accordingly. For additional questions or model contributions, open a pull request directly in the community repository channel.

Downloads last month
62
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Xerv-AI/tarn

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(111)
this model

Datasets used to train Xerv-AI/tarn

Space using Xerv-AI/tarn 1