π VIPER-L1: A Family of Small Multimodal-LLMs
Note: This model is still in improving, so we recommend to fine-tune this model in your use case!
π Overview
Viper-L1 is an open-source small multimodal large language model (Multimodal-LLM) designed for efficient multimodal reasoning and deployment on consumer GPUs. It is built upon the Liquid Model architecture (β1.2B parameters), enabling a powerful yet lightweight foundation for personal research, on-device applications, and internal experimentation.
π§ Key Features
β‘ Efficient Training & Inference Trained on 2Γ H100 GPUs within ~2 days, thanks to our lightweight multimodal fusion and liquid transformer design. Inference runs smoothly even on RTX 4070 GPUs.
π Multimodal Connector (Sense Integration Module) Inspired by human perception, Viper-L1 introduces a connector that fuses signals from different sensory encoders (vision, audio, etc.), enabling deeper cross-modal alignment and improved reasoning.
π§© Hybrid Architecture Combines the semantic strength of Transformers with the efficiency of Liquid Neural Networks, resulting in a compact yet expressive multimodal model.
π Progress
- β Released β Viper-L1 model checkpoint
- π§© Coming Soon β Fully documented training and inference scripts
- π§© Coming Soon β Fully documented for post-training (LoRA, DPO, GRPO)
Stay tuned for our next updates on model fine-tuning and multimodal reasoning enhancements.
ποΈ Architecture
The overall architecture is shown below:
Main Components:
- π¨ Vision Encoder β Extracts compact visual embeddings
- π Multimodal Connector β Fuses sensory inputs efficiently
- π§ Language Backbone (LFM2-350M-based) β Performs semantic reasoning and response generation
π§ͺ The current Viper-L1 (1.2B parameters) was trained on ~4 million images using 2Γ H100 GPUs for 2 days.
π Benchmark Results
Benchmark | Task | Split | Metric | Viper-L1 (CoT) |
---|---|---|---|---|
RealWorldQA | VQA | Test | Accuracy | 33.73% |
Other results | VQA | Test | Accuracy | On going |
Notes. CoT = Chain-of-Thought prompting enabled during inference. Exact settings (temperature/top-p/max tokens) can influence results; see the inference snippet below to replicate typical generation settings.
π§© Usage
To get started with inference, follow the setup in the main repository:
π Viper-VLM Repository
π Example inference script: infer_viper.sh
Or you can use these functions for inference
import os
import argparse
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor
from model import ViperLMForCausalLM # your local class
IMAGE_TOKEN_ID = 64400
def build_messages(question: str, include_image: bool = True):
# Mirror CCDataset._format_prompt()
user_content = ("<image> " if include_image else "") + (question or "")
return [
{"role": "user", "content": user_content},
# assistant turn is left empty; apply_chat_template(add_generation_prompt=True) will add assistant prefix
]
@torch.inference_mode()
def generate_answer(
ckpt_dir: str,
tokenizer_path: str,
processor_path: str,
image_path: str,
question: str,
device: str = "cuda",
dtype: str = "bf16",
max_new_tokens: int = 128,
temperature: float = 0.2,
top_p: float = 0.9,
repetition_penalty: float = 1.05,
):
# --- device / dtype ---
device = torch.device(device if torch.cuda.is_available() else "cpu")
use_bf16 = (dtype.lower() == "bf16")
use_fp16 = (dtype.lower() == "fp16")
amp_dtype = torch.bfloat16 if use_bf16 else (torch.float16 if use_fp16 else torch.float32)
# --- tokenizer / processor ---
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=True)
if tokenizer.pad_token_id is None:
tokenizer.pad_token = tokenizer.eos_token
# optional but common for generation with left context
if not hasattr(tokenizer, "padding_side") or tokenizer.padding_side != "left":
tokenizer.padding_side = "left"
processor = AutoProcessor.from_pretrained(processor_path)
# --- model ---
model = ViperLMForCausalLM.from_pretrained(
ckpt_dir,
torch_dtype=amp_dtype if device.type == "cuda" else torch.float32,
).to(device)
model.eval()
if getattr(model.config, "pad_token_id", None) is None:
model.config.pad_token_id = tokenizer.pad_token_id
# expose image token id if your forward expects it; keep it consistent with training
image_token_id = getattr(model.config, "image_token_id", None)
if image_token_id is None and "<image>" in tokenizer.get_vocab():
image_token_id = tokenizer.convert_tokens_to_ids("<image>")
# --- text input with the SAME chat template as training ---
messages = build_messages(question=question, include_image=True)
enc = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True, # adds assistant header the model expects before generation
tokenize=True,
return_tensors="pt",
)
if isinstance(enc, torch.Tensor):
input_ids = enc
attention_mask = torch.ones_like(enc, dtype=torch.long)
else:
input_ids = enc["input_ids"]
attention_mask = enc.get("attention_mask")
if attention_mask is None:
attention_mask = torch.ones_like(input_ids, dtype=torch.long)
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
# --- image preprocessing (match training) ---
img = Image.open(image_path).convert("RGB")
proc = processor(images=[img], return_tensors="pt") # list, like training
pixel_values = proc.get("pixel_values", None)
if pixel_values is None:
raise ValueError("Processor did not return 'pixel_values'. Check processor_path.")
pixel_values = pixel_values.to(device) # (1, 3, H, W)
# --- generate ---
gen_kwargs = {
"max_new_tokens": max_new_tokens,
"do_sample": temperature > 0.0,
"temperature": max(temperature, 1e-6),
"top_p": top_p,
"repetition_penalty": repetition_penalty,
"eos_token_id": tokenizer.eos_token_id,
"pad_token_id": tokenizer.pad_token_id,
"image_inputs": pixel_values,
# IMPORTANT: use the same argument names your model.forward saw in training # not "image_inputs"
"image_token_id": image_token_id, # if your forward uses it
"use_cache": False,
}
if device.type == "cuda" and (use_bf16 or use_fp16):
with torch.autocast(device_type="cuda", dtype=amp_dtype):
out = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
**gen_kwargs
)
else:
out = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
**gen_kwargs
)
# --- decode only new tokens ---
generated = out[0]
prompt_len = input_ids.size(1)
new_tokens = generated[prompt_len:]
answer = tokenizer.decode(new_tokens, skip_special_tokens=True)
return answer.strip()
if __name__ == "__main__":
ckpt_dir = ""
tokenizer_path = ""
processor_path = ""
image_path = ""
question = ""
device = ""
ans = generate_answer(
ckpt_dir=ckpt_dir,
tokenizer_path=tokenizer_path,
processor_path=processor_path,
image_path=image_path,
question=question,
device=device,
dtype="bfloat16",
max_new_tokens=128,
temperature=0.7,
top_p=0.8,
repetition_penalty=1
)
print("\n ======Answer===== \n")
print(ans)
π Acknowledgements
We gratefully thank the following foundational projects for inspiring and enabling our research:
- Liquid Model β Base architecture for dynamic neural computation
- SigLIP β Vision encoder powering multimodal understanding
Their open-source contributions have made Viper-L1 possible. π
π« Contact
If youβre interested in collaboration or research discussions: π Contact us or open an issue in the repository.
Model tree for huyquoctrinh/VIPER-L1-Prepend-CoT
Base model
LiquidAI/LFM2-350M