LFM2-VL-450M-jp (Japanese)

Model Description

LFM2-VL-450M-jp is a Japanese fine-tuned variant of LiquidAI/LFM2-VL-450M, optimized for Japanese vision-language tasks. This model maintains the efficiency and low-latency characteristics of the original LFM2-VL architecture while specializing in Japanese language understanding and image description.

  • Developed by: Alfaxad
  • Base Model: LiquidAI/LFM2-VL-450M
  • Model type: Vision-Language Model (Multimodal)
  • Language: Japanese (日本語)
  • License: LFM Open License v1.0
  • Finetuned from: LiquidAI/LFM2-VL-450M (450M parameters)

Key Features

  • Japanese Language Support: Specialized for Japanese image understanding and description tasks
  • Efficient Architecture: Maintains the 450M parameter count (350M LM + 86M vision encoder)
  • Low Latency: Optimized for edge AI applications and resource-constrained environments
  • Multi-turn Conversations: Trained on conversational data for interactive vision-language tasks
  • Native Resolution Processing: Handles images up to 512×512 pixels without upscaling

Model Details

Property Value
Parameters (LM only) 350M
Vision encoder SigLIP2 NaFlex base (86M)
Backbone layers hybrid conv+attention
Context (text) 32,768 tokens
Image tokens dynamic, user-tunable
Vocab size 65,536
Precision bfloat16

Training Data

The model was fine-tuned on approximately 98,000 multi-turn conversational samples from:

  • Dataset: llm-jp/ja-vg-vqa-conversation
  • Content: Japanese visual question-answering conversations
  • Format: Multi-turn dialogues with image context

Intended Use

Primary Use Cases

  • Japanese image captioning and description
  • Visual question answering in Japanese
  • Multi-turn conversations about images in Japanese
  • Japanese document understanding and OCR tasks
  • Edge AI applications requiring Japanese language support

Recommended Applications

  • Japanese e-commerce product description
  • Japanese accessibility tools for visual content
  • Japanese educational applications
  • Japanese content moderation and analysis
  • Japanese chatbots with visual understanding

How to Use

Installation

pip install -U transformers pillow

Basic Usage

from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

# Load model and processor
model_id = "Alfaxad/LFM2-VL-450M-jp"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="bfloat16",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image and create conversation in Japanese
image = load_image("your_image_url_or_path.jpg")
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "この画像には何が写っていますか?"},
        ],
    },
]

# Generate response
inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)

Recommended Generation Parameters

  • Temperature: 0.1
  • min_p: 0.15
  • repetition_penalty: 1.05
  • min_image_tokens: 64
  • max_image_tokens: 256
  • do_image_splitting: True

Chat Template

The model uses a ChatML-like format:

<|startoftext|><|im_start|>system
あなたはLiquid AIによる有用なマルチモーダルアシスタントです。<|im_end|>
<|im_start|>user
<image>この画像を説明してください。<|im_end|>
<|im_start|>assistant
この画像には...<|im_end|>

Training Details

Training Procedure

  • Base Model: LiquidAI/LFM2-VL-450M
  • Fine-tuning Method: Supervised Fine-Tuning (SFT) with LoRA adapters
  • Framework: Hugging Face TRL (Transformer Reinforcement Learning)
  • Training Data: ~98,000 multi-turn conversations
  • Training Regime: bfloat16 mixed precision

Training Hyperparameters

  • Training approach: LoRA (Low-Rank Adaptation) fine-tuning
  • Dataset size: ~98,000 samples
  • Data format: Multi-turn conversational VQA

Performance Considerations

As a fine-tuned variant of LFM2-VL-450M:

  • Optimized for Japanese: Best performance on Japanese language tasks
  • Resource Efficient: Suitable for edge devices and constrained environments
  • Recommended Use: Fine-tune further on specific Japanese use cases for optimal performance

Note: This is a specialized model for Japanese. For English tasks, consider using the original LiquidAI/LFM2-VL-450M.

Limitations

  • Language Specialization: Primarily designed for Japanese; performance on other languages may be limited
  • Model Size: As a 450M parameter model, it may not match the capabilities of larger models on complex reasoning tasks
  • Domain Specificity: Performance is optimized for the types of conversations present in the training data
  • Safety: Not intended for safety-critical decisions without additional validation
  • Narrow Use Cases: Best results when fine-tuned on specific downstream tasks

Ethical Considerations

  • Bias: The model may reflect biases present in the training data (ja-vg-vqa-conversation dataset)
  • Misuse Potential: Should not be used for generating misleading or harmful content
  • Privacy: Do not process images containing sensitive personal information without appropriate consent
  • Cultural Context: Trained on Japanese data; cultural nuances should be considered

Citation

If you use this model, please cite both the original LFM2-VL model and this fine-tuned variant:

@misc{lfm2-vl-450m-jp,
  author = {Alfaxad},
  title = {LFM2-VL-450M-jp: Japanese Fine-tuned Vision-Language Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Alfaxad/LFM2-VL-450M-jp}
}

@misc{liquid-lfm2-vl,
  author = {Liquid AI},
  title = {LFM2-VL: Efficient Vision-Language Models},
  year = {2025},
  url = {https://huggingface.co/LiquidAI/LFM2-VL-450M}
}

Acknowledgments

  • Base Model: Liquid AI for the LFM2-VL architecture
  • Training Data: llm-jp for the ja-vg-vqa-conversation dataset
  • Framework: Hugging Face for transformers and TRL libraries

Additional Resources

Downloads last month
32
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Alfaxad/LFM2-VL-450M-jp

Finetuned
(14)
this model