kevin510
/

friday

Text Generation

vision-language

Model card Files Files and versions

friday / README.md

kevin510's picture

Update README.md

58268fd verified 3 months ago

|

history blame contribute delete

2.68 kB

	---
	license: apache-2.0
	datasets:
	- liuhaotian/LLaVA-Instruct-150K
	- liuhaotian/LLaVA-Pretrain
	base_model:
	- microsoft/Phi-4-mini-reasoning
	- kevin510/fast-vit-hd
	library_name: transformers
	tags:
	- vision-language
	- multimodal
	- friday
	- custom_code
	- bf16
	---

	# Friday-VLM

	Friday-VLM is a multimodal (image + text) LLM fine-tuned on image and text instruction data.
	The architecture and config live in this repo, so callers must load the model with
	`trust_remote_code=True`.

	---

	# Model variants

	\| Repo ID \| Precision \| File format \| Typical VRAM* \| Size on disk \|
	\|---------\|-----------\|-------------\|---------------\|--------------\|
	\| `kevin510/friday` \| bf16 (full) \| `safetensors` \| 100 % \| 100 % \|
	\| `kevin510/friday-fp4` \| fp4 (bitsandbytes int4) \| `safetensors` \| ≈ 30 % \| ≈ 25 % \|

	---


	# Dependencies

	```bash
	conda create --name friday python=3.12 -y
	conda activate friday
	pip install transformers torch torchvision deepspeed accelerate pillow einops timm
	```

	# Quick start

	```python
	import torch
	from PIL import Image
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from transformers.utils import logging

	tok = AutoTokenizer.from_pretrained("kevin510/friday", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	"kevin510/friday",
	trust_remote_code=True,
	device_map="auto"
	)
	model.eval()

	prompt = "Describe this image."
	user_prompt = f"<\|user\|><image>\n{prompt}\n<\|assistant\|>"
	inputs = tok(user_prompt, return_tensors="pt").to(model.device)

	image = Image.open("my_image.jpg").convert("RGB")

	with torch.no_grad():
	out = model.generate(
	**inputs,
	max_new_tokens=256,
	do_sample=False,
	images=[image]
	)

	print(tok.decode(out[0], skip_special_tokens=False))
	```

	# Architecture at a glance

	```
	FastViT-HD ─▶ 3072-d patch embeddings ─▶ S2 6144-d patch embeddings ─▶ 2-layer MLP vision-adapter (6144 → 3072)

	(vision tokens, 3072 d) ─┐
	├─► Φ-4-mini-reasoning (2.7 B params, hidden = 3072)
	<text tokens, 3072 d> ───┘ │
	│ (standard self-attention only;
	│ language tower is frozen at finetune)
	```




	# Limitations & Responsible AI

	Friday-VLM may hallucinate objects, invent facts, or reproduce societal biases.
	All variants share the same behaviour profile; quantisation does not filter or sanitise model outputs. Users must apply their own content-safety layer before deployment.

	# Citation

	```bibtex
	@misc{friday2025,
	title = {Friday VLM: Efficient Instruction-Tuned Vision–Language Modelling},
	author = {Your Name et al.},
	year = {2025},
	url = {https://huggingface.co/kevin510/friday}
	}
	```