Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -205,24 +205,23 @@ and decode tokens per second will be more important than time to first token.
 |----------------------------------|----------------|--------------------------|
 |                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq     |
 | latency (batch_size=1)           | 2.46s          | 2.2s (12% speedup)       |
-| latency (batch_size=128)         | 6.55s          | 17s (60% slowdown)       |
 | serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (20% speedup) |
-| serving (num_prompts=1000)       | 24.15 req/s    | 5.64 req/s (77% slowdown)|
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
-Need to install vllm nightly to get some recent changes
-```
-pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
-```
 ## Download dataset
 Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
 Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
 ## benchmark_latency
 Run the following under `vllm` source code root folder:
 ### baseline

 |----------------------------------|----------------|--------------------------|
 |                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq     |
 | latency (batch_size=1)           | 2.46s          | 2.2s (12% speedup)       |
 | serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (20% speedup) |
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
 ## Download dataset
 Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
 Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
 ## benchmark_latency
+Need to install vllm nightly to get some recent changes
+```
+pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+```
 Run the following under `vllm` source code root folder:
 ### baseline