|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- liuhaotian/LLaVA-Instruct-150K |
|
- liuhaotian/LLaVA-Pretrain |
|
base_model: |
|
- microsoft/Phi-4-mini-reasoning |
|
- kevin510/fast-vit-hd |
|
library_name: transformers |
|
tags: |
|
- vision-language |
|
- multimodal |
|
- friday |
|
- custom_code |
|
- bf16 |
|
--- |
|
|
|
# Friday-VLM |
|
|
|
Friday-VLM is a multimodal (image + text) LLM fine-tuned on image and text instruction data. |
|
The architecture and config live in this repo, so callers must load the model with |
|
`trust_remote_code=True`. |
|
|
|
--- |
|
|
|
# Model variants |
|
|
|
| Repo ID | Precision | File format | Typical VRAM* | Size on disk | |
|
|---------|-----------|-------------|---------------|--------------| |
|
| `kevin510/friday` | **bf16** (full) | `safetensors` | 100 % | 100 % | |
|
| `kevin510/friday-fp4` | **fp4** (bitsandbytes int4) | `safetensors` | ≈ 30 % | ≈ 25 % | |
|
|
|
--- |
|
|
|
|
|
# Dependencies |
|
|
|
```bash |
|
conda create --name friday python=3.12 -y |
|
conda activate friday |
|
pip install transformers torch torchvision deepspeed accelerate pillow einops timm |
|
``` |
|
|
|
# Quick start |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from transformers.utils import logging |
|
|
|
tok = AutoTokenizer.from_pretrained("kevin510/friday", trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"kevin510/friday", |
|
trust_remote_code=True, |
|
device_map="auto" |
|
) |
|
model.eval() |
|
|
|
prompt = "Describe this image." |
|
user_prompt = f"<|user|><image>\n{prompt}\n<|assistant|>" |
|
inputs = tok(user_prompt, return_tensors="pt").to(model.device) |
|
|
|
image = Image.open("my_image.jpg").convert("RGB") |
|
|
|
with torch.no_grad(): |
|
out = model.generate( |
|
**inputs, |
|
max_new_tokens=256, |
|
do_sample=False, |
|
images=[image] |
|
) |
|
|
|
print(tok.decode(out[0], skip_special_tokens=False)) |
|
``` |
|
|
|
# Architecture at a glance |
|
|
|
``` |
|
FastViT-HD ─▶ 3072-d patch embeddings ─▶ S2 6144-d patch embeddings ─▶ 2-layer MLP vision-adapter (6144 → 3072) |
|
|
|
(vision tokens, 3072 d) ─┐ |
|
├─► Φ-4-mini-reasoning (2.7 B params, hidden = 3072) |
|
<text tokens, 3072 d> ───┘ │ |
|
│ (standard self-attention only; |
|
│ language tower is frozen at finetune) |
|
``` |
|
|
|
|
|
|
|
|
|
# Limitations & Responsible AI |
|
|
|
Friday-VLM may hallucinate objects, invent facts, or reproduce societal biases. |
|
All variants share the same behaviour profile; quantisation does not filter or sanitise model outputs. Users must apply their own content-safety layer before deployment. |
|
|
|
# Citation |
|
|
|
```bibtex |
|
@misc{friday2025, |
|
title = {Friday VLM: Efficient Instruction-Tuned Vision–Language Modelling}, |
|
author = {Your Name et al.}, |
|
year = {2025}, |
|
url = {https://huggingface.co/kevin510/friday} |
|
} |
|
``` |