File size: 2,682 Bytes
5d87bf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ac96179
5d87bf6
ac96179
5d87bf6
 
 
 
 
 
 
 
 
 
 
 
 
 
ac96179
 
 
 
2acb459
 
 
ac96179
 
5d87bf6
 
 
 
 
 
ac96179
5d87bf6
ac96179
5d87bf6
ac96179
5d87bf6
ac96179
5d87bf6
 
 
ac96179
 
 
 
2acb459
5d87bf6
 
ac96179
 
 
 
 
 
5d87bf6
 
 
 
2acb459
 
 
58268fd
2acb459
 
 
 
 
 
 
 
 
 
 
5d87bf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: apache-2.0
datasets:
- liuhaotian/LLaVA-Instruct-150K
- liuhaotian/LLaVA-Pretrain
base_model:
- microsoft/Phi-4-mini-reasoning
- kevin510/fast-vit-hd
library_name: transformers
tags:
- vision-language
- multimodal
- friday
- custom_code
- bf16
---

# Friday-VLM

Friday-VLM is a multimodal (image + text) LLM fine-tuned on image and text instruction data.
The architecture and config live in this repo, so callers must load the model with
`trust_remote_code=True`.  

---

# Model variants

| Repo ID | Precision | File format | Typical VRAM* | Size on disk |
|---------|-----------|-------------|---------------|--------------|
| `kevin510/friday`       | **bf16** (full) | `safetensors` | 100 % | 100 % |
| `kevin510/friday-fp4`   | **fp4** (bitsandbytes int4) | `safetensors` |  ≈ 30 % |  ≈ 25 % |

---


# Dependencies

```bash
conda create --name friday python=3.12 -y
conda activate friday
pip install transformers torch torchvision  deepspeed accelerate pillow einops timm
```

# Quick start

```python
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import logging

tok = AutoTokenizer.from_pretrained("kevin510/friday", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "kevin510/friday",
    trust_remote_code=True,
    device_map="auto" 
)
model.eval()

prompt = "Describe this image."
user_prompt = f"<|user|><image>\n{prompt}\n<|assistant|>"
inputs = tok(user_prompt, return_tensors="pt").to(model.device)

image = Image.open("my_image.jpg").convert("RGB")

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        images=[image]
    )

print(tok.decode(out[0], skip_special_tokens=False))
```

# Architecture at a glance

```
FastViT-HD ─▶ 3072-d patch embeddings ─▶ S2 6144-d patch embeddings ─▶  2-layer MLP vision-adapter (6144 → 3072)

(vision tokens, 3072 d) ─┐
├─► Φ-4-mini-reasoning (2.7 B params, hidden = 3072)
<text tokens, 3072 d> ───┘ │
│ (standard self-attention only;
│ language tower is frozen at finetune)
```




# Limitations & Responsible AI

Friday-VLM may hallucinate objects, invent facts, or reproduce societal biases.
All variants share the same behaviour profile; quantisation does not filter or sanitise model outputs. Users must apply their own content-safety layer before deployment.

# Citation

```bibtex
@misc{friday2025,
  title   = {Friday VLM: Efficient Instruction-Tuned Vision–Language Modelling},
  author  = {Your Name et al.},
  year    = {2025},
  url     = {https://huggingface.co/kevin510/friday}
}
```