jerryzh168 commited on
Commit
6cedb9c
·
verified ·
1 Parent(s): 006f2f9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -17,7 +17,7 @@ base_model:
17
  pipeline_tag: text-generation
18
  ---
19
 
20
- [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team. Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 67% VRAM reduction and 12-20% speedup on A100 GPUs.
21
 
22
  # Inference with vLLM
23
  Install vllm nightly and torchao nightly to get some recent changes:
@@ -281,11 +281,11 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
281
  Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
282
 
283
  ## Results (A100 machine)
284
- | Benchmark (Latency) | | |
285
- |----------------------------------|----------------|--------------------------|
286
- | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
287
- | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
288
- | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
289
 
290
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
291
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
 
17
  pipeline_tag: text-generation
18
  ---
19
 
20
+ [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team. Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 67% VRAM reduction and 1.12x-1.2x speedup on A100 GPUs.
21
 
22
  # Inference with vLLM
23
  Install vllm nightly and torchao nightly to get some recent changes:
 
281
  Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
282
 
283
  ## Results (A100 machine)
284
+ | Benchmark (Latency) | | |
285
+ |----------------------------------|----------------|----------------------------|
286
+ | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
287
+ | latency (batch_size=1) | 2.46s | 2.2s (1.12x speedup) |
288
+ | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (1.20x speedup) |
289
 
290
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
291
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.