pytorch/Phi-4-mini-instruct-int4wo-hqq

pytorch org May 1

•

There's a lot of frontloaded installation instructions, which will delay people getting their first inference out can you seperate them out until you need them?
Eval without the baseline is not super meaningful
Curious how come we focused only on bs=1, we had marlin and hqq kernels to help with bringing that up a bit
You show speedup numbers but likely people will be interested more in peak VRAM savings
For VLLM can you push serving instructions up and benchmarking instructions down, the overall flow should be here's to play with this model and then here's how to benchmark it
A100 is a bit of a strange benchmarking choice, why not a consumer GPU (which you can rent on vast) or a newer enterprise GPU like an H100

jerryzh168

pytorch org May 1

Eval without the baseline is not super meaningful

yeah we are still running baseline eval

Curious how come we focused only on bs=1, we had marlin and hqq kernels to help with bringing that up a bit

We also tried gemlite int4wo, and it doesn't seem to work well with batch size 128, but maybe batch size 4, 8, 16 could be OK, also there was some issues with compile that need to be resolved right now (Hicham is working on it), but in general int4wo quant method is not optimized for large batch sizes, so we decided to just release this for now, and we can release int8 dynamic quantized model for large batch size serving.

You show speedup numbers but likely people will be interested more in peak VRAM savings

good point, I'll add memory saving results.

A100 is a bit of a strange benchmarking choice, why not a consumer GPU (which you can rent on vast) or a newer enterprise GPU like an H100

not aware of which consumer GPU we should use, tinygemm is optimized for A100 I think. we do have H100 benchmarks for float8dq as well

jerryzh168 changed discussion status to closed May 3

pytorch
/

Phi-4-mini-instruct-int4wo-hqq

Model Card Feedback