LFM2-VL-1.6B-jp (Japanese)

Model Description

LFM2-VL-1.6B-jp is a Japanese fine-tuned variant of LiquidAI/LFM2-VL-1.6B, optimized for Japanese vision-language tasks. This model maintains the efficiency and performance characteristics of the original LFM2-VL 1.6B architecture while specializing in Japanese language understanding and image description. With 1.6B parameters, this model offers enhanced capabilities compared to the 450M variant while remaining lightweight and suitable for edge deployment.

Developed by: Alfaxad
Base Model: LiquidAI/LFM2-VL-1.6B
Model type: Vision-Language Model (Multimodal)
Language: Japanese (日本語)
License: LFM Open License v1.0
Finetuned from: LiquidAI/LFM2-VL-1.6B (1.6B parameters)

Key Features

Japanese Language Support: Specialized for Japanese image understanding and description tasks
Enhanced Capabilities: 1.6B parameters provide improved reasoning and generation quality
Advanced Vision Encoder: SigLIP2 NaFlex shape-optimized (400M) for better visual understanding
Low Latency: 2× faster inference speed on GPUs compared to similar-sized VLMs
Multi-turn Conversations: Trained on conversational data for interactive vision-language tasks
Native Resolution Processing: Handles images up to 512×512 pixels without upscaling, with intelligent tiling for larger images

Model Details

Property	Value
Parameters (LM only)	1.2B
Vision encoder	SigLIP2 NaFlex shape-optimized (400M)
Total parameters	~1.6B
Backbone layers	hybrid conv+attention
Context (text)	32,768 tokens
Image tokens	dynamic, user-tunable
Vocab size	65,536
Precision	bfloat16

Training Data

The model was fine-tuned on approximately 98,000 multi-turn conversational samples from:

Dataset: llm-jp/ja-vg-vqa-conversation
Content: Japanese visual question-answering conversations
Format: Multi-turn dialogues with image context

Intended Use

Primary Use Cases

Japanese image captioning and detailed description
Visual question answering in Japanese with enhanced reasoning
Multi-turn conversations about images in Japanese
Japanese document understanding and OCR tasks
Complex visual reasoning tasks in Japanese
Edge AI applications requiring Japanese language support

Recommended Applications

Japanese e-commerce product analysis and description
Japanese accessibility tools for visual content
Japanese educational applications requiring visual understanding
Japanese content moderation and detailed analysis
Japanese chatbots with advanced visual understanding
Japanese document processing and information extraction

How to Use

Installation

pip install -U transformers pillow

Basic Usage

from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Load model and processor
model_id = "Alfaxad/LFM2-VL-1.6B-jp"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="bfloat16",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image and create conversation in Japanese
image = load_image("your_image_url_or_path.jpg")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "この画像について詳しく説明してください。"},
        ],
    },
]

# Generate response
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Multi-turn Conversation Example

# Multi-turn conversation
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "この画像には何が写っていますか？"},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "この画像には赤い車が道路に駐車されています。"},
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "車のメーカーはわかりますか？"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]

Recommended Generation Parameters

Temperature: 0.1
min_p: 0.15
repetition_penalty: 1.05
min_image_tokens: 64
max_image_tokens: 256
do_image_splitting: True
max_new_tokens: 128-512 (depending on task complexity)

Chat Template

The model uses a ChatML-like format:

<|startoftext|><|im_start|>system
あなたはLiquid AIによる有用なマルチモーダルアシスタントです。<|im_end|>
<|im_start|>user
<image>この画像を詳しく説明してください。<|im_end|>
<|im_start|>assistant
この画像には...<|im_end|>

Architecture Highlights

Hybrid backbone: LFM2-1.2B language model paired with SigLIP2 NaFlex shape-optimized vision encoder (400M)
Native resolution processing: Handles images up to 512×512 pixels without upscaling
Tiling strategy: Splits large images into non-overlapping 512×512 patches with thumbnail encoding for global context
Efficient token mapping: 2-layer MLP connector with pixel unshuffle reduces image tokens efficiently
Inference-time flexibility: User-tunable maximum image tokens and patch count for speed/quality tradeoff

Training Details

Training Procedure

Base Model: LiquidAI/LFM2-VL-1.6B
Fine-tuning Method: Supervised Fine-Tuning (SFT) with LoRA adapters
Framework: Hugging Face TRL (Transformer Reinforcement Learning)
Training Data: ~98,000 multi-turn conversations
Training Regime: bfloat16 mixed precision

Training Hyperparameters

Training approach: LoRA (Low-Rank Adaptation) fine-tuning
Dataset size: ~98,000 samples
Data format: Multi-turn conversational VQA
Language focus: Japanese

Performance Considerations

As a fine-tuned variant of LFM2-VL-1.6B:

Enhanced Capabilities: The 1.6B model offers improved reasoning, more detailed descriptions, and better handling of complex visual scenarios compared to the 450M variant
Optimized for Japanese: Best performance on Japanese language tasks
Resource Efficient: Still lightweight enough for edge devices while providing enhanced capabilities
Speed vs Quality: Offers better balance between inference speed and output quality
Recommended Use: Can be used out-of-the-box for many Japanese VLM tasks, though further fine-tuning on specific use cases will maximize performance

Comparison with 450M Variant

Aspect	LFM2-VL-450M-jp	LFM2-VL-1.6B-jp
Parameters	450M total	1.6B total
Vision Encoder	SigLIP2 NaFlex base (86M)	SigLIP2 NaFlex shape-optimized (400M)
Use Case	Highly constrained devices	More capable while still lightweight
Output Quality	Good for simple tasks	Better for complex reasoning
Inference Speed	Faster	Still fast, slightly slower
Memory Usage	Lower	Higher but manageable

Choosing between variants:

Use 450M for: Maximum speed, minimal resource usage, simple image descriptions
Use 1.6B for: Better quality outputs, complex reasoning, detailed analysis, professional applications

Limitations

Language Specialization: Primarily designed for Japanese; performance on other languages may be limited
Domain Specificity: Performance is optimized for the types of conversations present in the training data
Safety: Not intended for safety-critical decisions without additional validation
Complex Reasoning: While improved over 450M, may still struggle with highly complex multi-step reasoning compared to much larger models
Cultural Context: Trained on Japanese data; cultural nuances should be considered

Citation

If you use this model, please cite both the original LFM2-VL model and this fine-tuned variant:

@misc{lfm2-vl-1.6b-jp,
  author = {Alfaxad},
  title = {LFM2-VL-1.6B-jp: Japanese Fine-tuned Vision-Language Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Alfaxad/LFM2-VL-1.6B-jp}
}

@misc{liquid-lfm2-vl,
  author = {Liquid AI},
  title = {LFM2-VL: Efficient Vision-Language Models},
  year = {2025},
  url = {https://huggingface.co/LiquidAI/LFM2-VL-1.6B}
}

Acknowledgments

Base Model: Liquid AI for the LFM2-VL architecture
Training Data: llm-jp for the ja-vg-vqa-conversation dataset
Framework: Hugging Face for transformers and TRL libraries

Contact

For questions or issues regarding this model, please open an issue on the model's Hugging Face page or contact the model developer.

Additional Resources

Original Model: LiquidAI/LFM2-VL-1.6B
Smaller Variant: Alfaxad/LFM2-VL-450M-jp
Training Dataset: llm-jp/ja-vg-vqa-conversation
LFM2-VL Blog Post: Liquid AI Blog
Original Paper/Documentation: LFM2 Blog Post

Downloads last month: 25

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Alfaxad/LFM2-VL-1.6B-JP

Base model

LiquidAI/LFM2-VL-1.6B

Finetuned

(3)

this model