Which one should I use, Florence-2-base-ft and Florence-2-base?

#23
by CrimisionLFIre - opened

In the description, I cant understand the main different between Florence-2-base-ft and Florence-2-base.
My question is, If I want to fine tuning a Florence-2 on specific object detection dataset, which one should I use, ft version or without ft version?

For now I use Florence-2-base-ft, but the results are not good enough because I found CNN based method can beat Florence-2-base-ft very easily, so I guess maybe there is something wrong during my fine-tuning phase.

Hey I got he same question. Did you find the answer and what is the difference between the models?

same question, what's the difference?

Here is what GPT5-thinking summarized from the model description:

Short version: same architecture/size; different training stage.

  • Florence-2-base = the pre-trained checkpoint trained on Microsoft’s FLD-5B (5.4B annotations on 126M images). It’s the raw foundation model.
  • Florence-2-base-ft = the same model further fine-tuned on a curated mixture of downstream tasks (captioning, grounding/REC, detection, OCR, etc.) to make a single “generalist” model that performs better out-of-the-box on those tasks.

What that means in practice

  • Measured gains: Microsoft’s card shows the ft model improving typical benchmarks vs. the base model’s zero-shot numbers—for example, COCO Caption CIDEr 140.0 for base-ft (after fine-tune) vs. 133.0 for base (zero-shot); COCO detection mAP 41.4 (base-ft) vs. 34.7 (base zero-shot).
  • Under the hood: both are the same seq-to-seq VLM; the ft checkpoint is just an extra stage of supervised multi-task tuning layered on top of the FLD-5B pretraining described in the technical report.

Which to pick?

  • If you just want plug-and-play captions/OD/OCR, use Florence-2-base-ft. That’s what it’s for.
  • If you plan to fine-tune on your own data, starting from base is reasonable; some community users also report cases where base behaves better for niche prompts, but that’s anecdotal.

That’s the whole difference: pre-trained vs. the same model after multi-task fine-tuning—same size, same prompts, better out-of-box task performance on common datasets with the ft checkpoint.

If MS team has any comments feel free to pitch in

Sign up or log in to comment