jerryzh168 commited on
Commit
fe70e1e
·
verified ·
1 Parent(s): 7ecf1b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -7
README.md CHANGED
@@ -205,24 +205,23 @@ and decode tokens per second will be more important than time to first token.
205
  |----------------------------------|----------------|--------------------------|
206
  | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
207
  | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
208
- | latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
209
  | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
210
- | serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
211
 
212
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
213
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
214
 
215
- Need to install vllm nightly to get some recent changes
216
- ```
217
- pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
218
- ```
219
-
220
  ## Download dataset
221
  Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
222
 
223
  Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
224
  ## benchmark_latency
225
 
 
 
 
 
 
 
226
  Run the following under `vllm` source code root folder:
227
 
228
  ### baseline
 
205
  |----------------------------------|----------------|--------------------------|
206
  | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
207
  | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
 
208
  | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
 
209
 
210
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
211
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
212
 
 
 
 
 
 
213
  ## Download dataset
214
  Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
215
 
216
  Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
217
  ## benchmark_latency
218
 
219
+ Need to install vllm nightly to get some recent changes
220
+ ```
221
+ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
222
+ ```
223
+
224
+
225
  Run the following under `vllm` source code root folder:
226
 
227
  ### baseline