LFM2-VL-1.6B-jp (Japanese)

Model Description

LFM2-VL-1.6B-jp is a Japanese fine-tuned variant of LiquidAI/LFM2-VL-1.6B, optimized for Japanese vision-language tasks. This model maintains the efficiency and performance characteristics of the original LFM2-VL 1.6B architecture while specializing in Japanese language understanding and image description. With 1.6B parameters, this model offers enhanced capabilities compared to the 450M variant while remaining lightweight and suitable for edge deployment.

  • Developed by: Alfaxad
  • Base Model: LiquidAI/LFM2-VL-1.6B
  • Model type: Vision-Language Model (Multimodal)
  • Language: Japanese (日本語)
  • License: LFM Open License v1.0
  • Finetuned from: LiquidAI/LFM2-VL-1.6B (1.6B parameters)

Key Features

  • Japanese Language Support: Specialized for Japanese image understanding and description tasks
  • Enhanced Capabilities: 1.6B parameters provide improved reasoning and generation quality
  • Advanced Vision Encoder: SigLIP2 NaFlex shape-optimized (400M) for better visual understanding
  • Low Latency: 2× faster inference speed on GPUs compared to similar-sized VLMs
  • Multi-turn Conversations: Trained on conversational data for interactive vision-language tasks
  • Native Resolution Processing: Handles images up to 512×512 pixels without upscaling, with intelligent tiling for larger images

Model Details

Property Value
Parameters (LM only) 1.2B
Vision encoder SigLIP2 NaFlex shape-optimized (400M)
Total parameters ~1.6B
Backbone layers hybrid conv+attention
Context (text) 32,768 tokens
Image tokens dynamic, user-tunable
Vocab size 65,536
Precision bfloat16

Training Data

The model was fine-tuned on approximately 98,000 multi-turn conversational samples from:

  • Dataset: llm-jp/ja-vg-vqa-conversation
  • Content: Japanese visual question-answering conversations
  • Format: Multi-turn dialogues with image context

Intended Use

Primary Use Cases

  • Japanese image captioning and detailed description
  • Visual question answering in Japanese with enhanced reasoning
  • Multi-turn conversations about images in Japanese
  • Japanese document understanding and OCR tasks
  • Complex visual reasoning tasks in Japanese
  • Edge AI applications requiring Japanese language support

Recommended Applications

  • Japanese e-commerce product analysis and description
  • Japanese accessibility tools for visual content
  • Japanese educational applications requiring visual understanding
  • Japanese content moderation and detailed analysis
  • Japanese chatbots with advanced visual understanding
  • Japanese document processing and information extraction

How to Use

Installation

pip install -U transformers pillow

Basic Usage

from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Load model and processor
model_id = "Alfaxad/LFM2-VL-1.6B-jp"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="bfloat16",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image and create conversation in Japanese
image = load_image("your_image_url_or_path.jpg")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "この画像について詳しく説明してください。"},
        ],
    },
]

# Generate response
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Multi-turn Conversation Example

# Multi-turn conversation
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "この画像には何が写っていますか?"},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "この画像には赤い車が道路に駐車されています。"},
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "車のメーカーはわかりますか?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]

Recommended Generation Parameters

  • Temperature: 0.1
  • min_p: 0.15
  • repetition_penalty: 1.05
  • min_image_tokens: 64
  • max_image_tokens: 256
  • do_image_splitting: True
  • max_new_tokens: 128-512 (depending on task complexity)

Chat Template

The model uses a ChatML-like format:

<|startoftext|><|im_start|>system
あなたはLiquid AIによる有用なマルチモーダルアシスタントです。<|im_end|>
<|im_start|>user
<image>この画像を詳しく説明してください。<|im_end|>
<|im_start|>assistant
この画像には...<|im_end|>

Architecture Highlights

  • Hybrid backbone: LFM2-1.2B language model paired with SigLIP2 NaFlex shape-optimized vision encoder (400M)
  • Native resolution processing: Handles images up to 512×512 pixels without upscaling
  • Tiling strategy: Splits large images into non-overlapping 512×512 patches with thumbnail encoding for global context
  • Efficient token mapping: 2-layer MLP connector with pixel unshuffle reduces image tokens efficiently
  • Inference-time flexibility: User-tunable maximum image tokens and patch count for speed/quality tradeoff

Training Details

Training Procedure

  • Base Model: LiquidAI/LFM2-VL-1.6B
  • Fine-tuning Method: Supervised Fine-Tuning (SFT) with LoRA adapters
  • Framework: Hugging Face TRL (Transformer Reinforcement Learning)
  • Training Data: ~98,000 multi-turn conversations
  • Training Regime: bfloat16 mixed precision

Training Hyperparameters

  • Training approach: LoRA (Low-Rank Adaptation) fine-tuning
  • Dataset size: ~98,000 samples
  • Data format: Multi-turn conversational VQA
  • Language focus: Japanese

Performance Considerations

As a fine-tuned variant of LFM2-VL-1.6B:

  • Enhanced Capabilities: The 1.6B model offers improved reasoning, more detailed descriptions, and better handling of complex visual scenarios compared to the 450M variant
  • Optimized for Japanese: Best performance on Japanese language tasks
  • Resource Efficient: Still lightweight enough for edge devices while providing enhanced capabilities
  • Speed vs Quality: Offers better balance between inference speed and output quality
  • Recommended Use: Can be used out-of-the-box for many Japanese VLM tasks, though further fine-tuning on specific use cases will maximize performance

Comparison with 450M Variant

Aspect LFM2-VL-450M-jp LFM2-VL-1.6B-jp
Parameters 450M total 1.6B total
Vision Encoder SigLIP2 NaFlex base (86M) SigLIP2 NaFlex shape-optimized (400M)
Use Case Highly constrained devices More capable while still lightweight
Output Quality Good for simple tasks Better for complex reasoning
Inference Speed Faster Still fast, slightly slower
Memory Usage Lower Higher but manageable

Choosing between variants:

  • Use 450M for: Maximum speed, minimal resource usage, simple image descriptions
  • Use 1.6B for: Better quality outputs, complex reasoning, detailed analysis, professional applications

Limitations

  • Language Specialization: Primarily designed for Japanese; performance on other languages may be limited
  • Domain Specificity: Performance is optimized for the types of conversations present in the training data
  • Safety: Not intended for safety-critical decisions without additional validation
  • Complex Reasoning: While improved over 450M, may still struggle with highly complex multi-step reasoning compared to much larger models
  • Cultural Context: Trained on Japanese data; cultural nuances should be considered

Citation

If you use this model, please cite both the original LFM2-VL model and this fine-tuned variant:

@misc{lfm2-vl-1.6b-jp,
  author = {Alfaxad},
  title = {LFM2-VL-1.6B-jp: Japanese Fine-tuned Vision-Language Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Alfaxad/LFM2-VL-1.6B-jp}
}

@misc{liquid-lfm2-vl,
  author = {Liquid AI},
  title = {LFM2-VL: Efficient Vision-Language Models},
  year = {2025},
  url = {https://huggingface.co/LiquidAI/LFM2-VL-1.6B}
}

Acknowledgments

  • Base Model: Liquid AI for the LFM2-VL architecture
  • Training Data: llm-jp for the ja-vg-vqa-conversation dataset
  • Framework: Hugging Face for transformers and TRL libraries

Contact

For questions or issues regarding this model, please open an issue on the model's Hugging Face page or contact the model developer.

Additional Resources

Downloads last month
25
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Alfaxad/LFM2-VL-1.6B-JP

Finetuned
(3)
this model