FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Paper
•
2510.10921
•
Published
•
8
FG-CLIP 2 is the foundation model for fine-grained vision-language understanding in both English and Chinese.
Visualize image similarity to labels
Classify images based on given labels