Update README.md

Browse files

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ base_model:
 pipeline_tag: text-generation
 ---
-[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team. Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 67% VRAM reduction and 12-20% speedup on A100 GPUs.
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:
@@ -281,11 +281,11 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
 ## Results (A100 machine)
-| Benchmark (Latency)              |                |                          |
-|----------------------------------|----------------|--------------------------|
-|                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq     |
-| latency (batch_size=1)           | 2.46s          | 2.2s (12% speedup)       |
-| serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (20% speedup) |
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.

 pipeline_tag: text-generation
 ---
+[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team. Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 67% VRAM reduction and 1.12x-1.2x speedup on A100 GPUs.
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:
 Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
 ## Results (A100 machine)
+| Benchmark (Latency)              |                |                            |
+|----------------------------------|----------------|----------------------------|
+|                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq       |
+| latency (batch_size=1)           | 2.46s          | 2.2s (1.12x speedup)       |
+| serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (1.20x speedup) |
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.