Image-Text-to-Text
Transformers
ONNX
Safetensors
English
idefics3
image-to-text
conversational

can not believe, but seems 256M is slower then internvl-1B ?

#25
by josefph - opened

As title said, it's hard to believe that smolvlm-256M-instruct is slower then internvl-1B. Even i inspect the input embedding and params still can not figure out why ?

internvl-1B >
inp_embed : (1, 547, 896)
trainable params: 17,596,416 || all params: 647,260,288 || trainable%: 2.7186

smolvlm-256M >
inp_embed : (1, 171, 576)
trainable params: 9,768,960 || all params: 172,742,976 || trainable%: 5.6552

Does someone have similar issue ??

AutoDriving ms :
image

Sign up or log in to comment