Update README.md
Browse files
README.md
CHANGED
@@ -205,24 +205,23 @@ and decode tokens per second will be more important than time to first token.
|
|
205 |
|----------------------------------|----------------|--------------------------|
|
206 |
| | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
|
207 |
| latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
|
208 |
-
| latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
|
209 |
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
|
210 |
-
| serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
|
211 |
|
212 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
213 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
214 |
|
215 |
-
Need to install vllm nightly to get some recent changes
|
216 |
-
```
|
217 |
-
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
218 |
-
```
|
219 |
-
|
220 |
## Download dataset
|
221 |
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
|
222 |
|
223 |
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
|
224 |
## benchmark_latency
|
225 |
|
|
|
|
|
|
|
|
|
|
|
|
|
226 |
Run the following under `vllm` source code root folder:
|
227 |
|
228 |
### baseline
|
|
|
205 |
|----------------------------------|----------------|--------------------------|
|
206 |
| | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
|
207 |
| latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
|
|
|
208 |
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
|
|
|
209 |
|
210 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
211 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
212 |
|
|
|
|
|
|
|
|
|
|
|
213 |
## Download dataset
|
214 |
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
|
215 |
|
216 |
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
|
217 |
## benchmark_latency
|
218 |
|
219 |
+
Need to install vllm nightly to get some recent changes
|
220 |
+
```
|
221 |
+
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
222 |
+
```
|
223 |
+
|
224 |
+
|
225 |
Run the following under `vllm` source code root folder:
|
226 |
|
227 |
### baseline
|