Which one should I use, Florence-2-base-ft and Florence-2-base?

#23

by CrimisionLFIre - opened Dec 11, 2024

Dec 11, 2024

In the description, I cant understand the main different between Florence-2-base-ft and Florence-2-base.
My question is, If I want to fine tuning a Florence-2 on specific object detection dataset, which one should I use, ft version or without ft version?

For now I use Florence-2-base-ft, but the results are not good enough because I found CNN based method can beat Florence-2-base-ft very easily, so I guess maybe there is something wrong during my fine-tuning phase.

defaultsettings

Feb 15

Hey I got he same question. Did you find the answer and what is the difference between the models?

ponkin

Jul 30

same question, what's the difference?

Novmik

Sep 29

•

edited Sep 29

Here is what GPT5-thinking summarized from the model description:

Short version: same architecture/size; different training stage.

Florence-2-base = the pre-trained checkpoint trained on Microsoft’s FLD-5B (5.4B annotations on 126M images). It’s the raw foundation model.
Florence-2-base-ft = the same model further fine-tuned on a curated mixture of downstream tasks (captioning, grounding/REC, detection, OCR, etc.) to make a single “generalist” model that performs better out-of-the-box on those tasks.

What that means in practice

Measured gains: Microsoft’s card shows the ft model improving typical benchmarks vs. the base model’s zero-shot numbers—for example, COCO Caption CIDEr 140.0 for base-ft (after fine-tune) vs. 133.0 for base (zero-shot); COCO detection mAP 41.4 (base-ft) vs. 34.7 (base zero-shot).
Under the hood: both are the same seq-to-seq VLM; the ft checkpoint is just an extra stage of supervised multi-task tuning layered on top of the FLD-5B pretraining described in the technical report.

Which to pick?

If you just want plug-and-play captions/OD/OCR, use Florence-2-base-ft. That’s what it’s for.
If you plan to fine-tune on your own data, starting from base is reasonable; some community users also report cases where base behaves better for niche prompts, but that’s anecdotal.

That’s the whole difference: pre-trained vs. the same model after multi-task fine-tuning—same size, same prompts, better out-of-box task performance on common datasets with the ft checkpoint.

If MS team has any comments feel free to pitch in

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment