Which one should I use, Florence-2-base-ft and Florence-2-base?
In the description, I cant understand the main different between Florence-2-base-ft and Florence-2-base.
My question is, If I want to fine tuning a Florence-2 on specific object detection dataset, which one should I use, ft version or without ft version?
For now I use Florence-2-base-ft, but the results are not good enough because I found CNN based method can beat Florence-2-base-ft very easily, so I guess maybe there is something wrong during my fine-tuning phase.
Hey I got he same question. Did you find the answer and what is the difference between the models?
same question, what's the difference?
Here is what GPT5-thinking summarized from the model description:
Short version: same architecture/size; different training stage.
Florence-2-base= the pre-trained checkpoint trained on Microsoft’s FLD-5B (5.4B annotations on 126M images). It’s the raw foundation model.Florence-2-base-ft= the same model further fine-tuned on a curated mixture of downstream tasks (captioning, grounding/REC, detection, OCR, etc.) to make a single “generalist” model that performs better out-of-the-box on those tasks.
What that means in practice
- Measured gains: Microsoft’s card shows the ft model improving typical benchmarks vs. the base model’s zero-shot numbers—for example, COCO Caption CIDEr 140.0 for
base-ft(after fine-tune) vs. 133.0 forbase(zero-shot); COCO detection mAP 41.4 (base-ft) vs. 34.7 (basezero-shot). - Under the hood: both are the same seq-to-seq VLM; the ft checkpoint is just an extra stage of supervised multi-task tuning layered on top of the FLD-5B pretraining described in the technical report.
Which to pick?
- If you just want plug-and-play captions/OD/OCR, use
Florence-2-base-ft. That’s what it’s for. - If you plan to fine-tune on your own data, starting from
baseis reasonable; some community users also report cases wherebasebehaves better for niche prompts, but that’s anecdotal.
That’s the whole difference: pre-trained vs. the same model after multi-task fine-tuning—same size, same prompts, better out-of-box task performance on common datasets with the ft checkpoint.
If MS team has any comments feel free to pitch in