LFM2-VL-1.6B-jp (Japanese)
Model Description
LFM2-VL-1.6B-jp is a Japanese fine-tuned variant of LiquidAI/LFM2-VL-1.6B, optimized for Japanese vision-language tasks. This model maintains the efficiency and performance characteristics of the original LFM2-VL 1.6B architecture while specializing in Japanese language understanding and image description. With 1.6B parameters, this model offers enhanced capabilities compared to the 450M variant while remaining lightweight and suitable for edge deployment.
- Developed by: Alfaxad
- Base Model: LiquidAI/LFM2-VL-1.6B
- Model type: Vision-Language Model (Multimodal)
- Language: Japanese (日本語)
- License: LFM Open License v1.0
- Finetuned from: LiquidAI/LFM2-VL-1.6B (1.6B parameters)
Key Features
- Japanese Language Support: Specialized for Japanese image understanding and description tasks
- Enhanced Capabilities: 1.6B parameters provide improved reasoning and generation quality
- Advanced Vision Encoder: SigLIP2 NaFlex shape-optimized (400M) for better visual understanding
- Low Latency: 2× faster inference speed on GPUs compared to similar-sized VLMs
- Multi-turn Conversations: Trained on conversational data for interactive vision-language tasks
- Native Resolution Processing: Handles images up to 512×512 pixels without upscaling, with intelligent tiling for larger images
Model Details
Property | Value |
---|---|
Parameters (LM only) | 1.2B |
Vision encoder | SigLIP2 NaFlex shape-optimized (400M) |
Total parameters | ~1.6B |
Backbone layers | hybrid conv+attention |
Context (text) | 32,768 tokens |
Image tokens | dynamic, user-tunable |
Vocab size | 65,536 |
Precision | bfloat16 |
Training Data
The model was fine-tuned on approximately 98,000 multi-turn conversational samples from:
- Dataset: llm-jp/ja-vg-vqa-conversation
- Content: Japanese visual question-answering conversations
- Format: Multi-turn dialogues with image context
Intended Use
Primary Use Cases
- Japanese image captioning and detailed description
- Visual question answering in Japanese with enhanced reasoning
- Multi-turn conversations about images in Japanese
- Japanese document understanding and OCR tasks
- Complex visual reasoning tasks in Japanese
- Edge AI applications requiring Japanese language support
Recommended Applications
- Japanese e-commerce product analysis and description
- Japanese accessibility tools for visual content
- Japanese educational applications requiring visual understanding
- Japanese content moderation and detailed analysis
- Japanese chatbots with advanced visual understanding
- Japanese document processing and information extraction
How to Use
Installation
pip install -U transformers pillow
Basic Usage
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
# Load model and processor
model_id = "Alfaxad/LFM2-VL-1.6B-jp"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
device_map="auto",
torch_dtype="bfloat16",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load image and create conversation in Japanese
image = load_image("your_image_url_or_path.jpg")
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "この画像について詳しく説明してください。"},
],
},
]
# Generate response
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
Multi-turn Conversation Example
# Multi-turn conversation
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "この画像には何が写っていますか?"},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "この画像には赤い車が道路に駐車されています。"},
],
},
{
"role": "user",
"content": [
{"type": "text", "text": "車のメーカーはわかりますか?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
Recommended Generation Parameters
- Temperature: 0.1
- min_p: 0.15
- repetition_penalty: 1.05
- min_image_tokens: 64
- max_image_tokens: 256
- do_image_splitting: True
- max_new_tokens: 128-512 (depending on task complexity)
Chat Template
The model uses a ChatML-like format:
<|startoftext|><|im_start|>system
あなたはLiquid AIによる有用なマルチモーダルアシスタントです。<|im_end|>
<|im_start|>user
<image>この画像を詳しく説明してください。<|im_end|>
<|im_start|>assistant
この画像には...<|im_end|>
Architecture Highlights
- Hybrid backbone: LFM2-1.2B language model paired with SigLIP2 NaFlex shape-optimized vision encoder (400M)
- Native resolution processing: Handles images up to 512×512 pixels without upscaling
- Tiling strategy: Splits large images into non-overlapping 512×512 patches with thumbnail encoding for global context
- Efficient token mapping: 2-layer MLP connector with pixel unshuffle reduces image tokens efficiently
- Inference-time flexibility: User-tunable maximum image tokens and patch count for speed/quality tradeoff
Training Details
Training Procedure
- Base Model: LiquidAI/LFM2-VL-1.6B
- Fine-tuning Method: Supervised Fine-Tuning (SFT) with LoRA adapters
- Framework: Hugging Face TRL (Transformer Reinforcement Learning)
- Training Data: ~98,000 multi-turn conversations
- Training Regime: bfloat16 mixed precision
Training Hyperparameters
- Training approach: LoRA (Low-Rank Adaptation) fine-tuning
- Dataset size: ~98,000 samples
- Data format: Multi-turn conversational VQA
- Language focus: Japanese
Performance Considerations
As a fine-tuned variant of LFM2-VL-1.6B:
- Enhanced Capabilities: The 1.6B model offers improved reasoning, more detailed descriptions, and better handling of complex visual scenarios compared to the 450M variant
- Optimized for Japanese: Best performance on Japanese language tasks
- Resource Efficient: Still lightweight enough for edge devices while providing enhanced capabilities
- Speed vs Quality: Offers better balance between inference speed and output quality
- Recommended Use: Can be used out-of-the-box for many Japanese VLM tasks, though further fine-tuning on specific use cases will maximize performance
Comparison with 450M Variant
Aspect | LFM2-VL-450M-jp | LFM2-VL-1.6B-jp |
---|---|---|
Parameters | 450M total | 1.6B total |
Vision Encoder | SigLIP2 NaFlex base (86M) | SigLIP2 NaFlex shape-optimized (400M) |
Use Case | Highly constrained devices | More capable while still lightweight |
Output Quality | Good for simple tasks | Better for complex reasoning |
Inference Speed | Faster | Still fast, slightly slower |
Memory Usage | Lower | Higher but manageable |
Choosing between variants:
- Use 450M for: Maximum speed, minimal resource usage, simple image descriptions
- Use 1.6B for: Better quality outputs, complex reasoning, detailed analysis, professional applications
Limitations
- Language Specialization: Primarily designed for Japanese; performance on other languages may be limited
- Domain Specificity: Performance is optimized for the types of conversations present in the training data
- Safety: Not intended for safety-critical decisions without additional validation
- Complex Reasoning: While improved over 450M, may still struggle with highly complex multi-step reasoning compared to much larger models
- Cultural Context: Trained on Japanese data; cultural nuances should be considered
Citation
If you use this model, please cite both the original LFM2-VL model and this fine-tuned variant:
@misc{lfm2-vl-1.6b-jp,
author = {Alfaxad},
title = {LFM2-VL-1.6B-jp: Japanese Fine-tuned Vision-Language Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Alfaxad/LFM2-VL-1.6B-jp}
}
@misc{liquid-lfm2-vl,
author = {Liquid AI},
title = {LFM2-VL: Efficient Vision-Language Models},
year = {2025},
url = {https://huggingface.co/LiquidAI/LFM2-VL-1.6B}
}
Acknowledgments
- Base Model: Liquid AI for the LFM2-VL architecture
- Training Data: llm-jp for the ja-vg-vqa-conversation dataset
- Framework: Hugging Face for transformers and TRL libraries
Contact
For questions or issues regarding this model, please open an issue on the model's Hugging Face page or contact the model developer.
Additional Resources
- Original Model: LiquidAI/LFM2-VL-1.6B
- Smaller Variant: Alfaxad/LFM2-VL-450M-jp
- Training Dataset: llm-jp/ja-vg-vqa-conversation
- LFM2-VL Blog Post: Liquid AI Blog
- Original Paper/Documentation: LFM2 Blog Post
- Downloads last month
- 25
Model tree for Alfaxad/LFM2-VL-1.6B-JP
Base model
LiquidAI/LFM2-VL-1.6B