Qwen
/

Text Generation
Transformers
Safetensors
qwen3_next
conversational
jklj077 commited on
Commit
e502dd4
·
verified ·
1 Parent(s): 61c2c4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -19
README.md CHANGED
@@ -29,7 +29,7 @@ We are seeing strong performance in terms of both parameter efficiency and infer
29
 
30
  ![Qwen3-Next-80B-A3B-Thinking Benchmark Comparison](https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-Next/Qwen3-Next-80B-A3B-Thinking.001.jpeg)
31
 
32
- For more details, please refer to our blog post [Qwen3-Next](https://qwenlm.github.io/blog/qwen3_next/).
33
 
34
  ## Model Overview
35
 
@@ -48,9 +48,9 @@ For more details, please refer to our blog post [Qwen3-Next](https://qwenlm.gith
48
  - Training Stage: Pretraining (15T tokens) & Post-training
49
  - Number of Parameters: 80B in total and 3B activated
50
  - Number of Paramaters (Non-Embedding): 79B
51
- - Number of Layers: 48
52
  - Hidden Dimension: 2048
53
- - Hybrid Layout: 12 \* (3 \* (Gated DeltaNet -> MoE) -> (Gated Attention -> MoE))
 
54
  - Gated Attention:
55
  - Number of Attention Heads: 16 for Q and 2 for KV
56
  - Head Dimension: 256
@@ -173,7 +173,7 @@ print("content:", content)
173
 
174
  > [!Tip]
175
  > Depending on the inference settings, you may observe better efficiency with [`flash-linear-attention`](https://github.com/fla-org/flash-linear-attention#installation) and [`causal-conv1d`](https://github.com/Dao-AILab/causal-conv1d).
176
- > See the above links for detailed instructions and requirements.
177
 
178
  ## Deployment
179
 
@@ -184,57 +184,58 @@ For deployment, you can use the latest `sglang` or `vllm` to create an OpenAI-co
184
  [SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
185
  SGLang could be used to launch a server with OpenAI-compatible API service.
186
 
187
- SGLang has supported Qwen3-Next in its `main` branch, which can be installed from source:
188
  ```shell
189
- pip install 'sglang[all] @ git+https://github.com/sgl-project/sglang.git@main#subdirectory=python'
190
  ```
 
191
 
192
  The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
193
  ```shell
194
- SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Thinking --port 30000 --tp-size 4 --context-length 262144 --reasoning-parser deepseek-r1 --mem-fraction-static 0.8
195
  ```
196
 
197
  The following command is recommended for MTP with the rest settings the same as above:
198
  ```shell
199
- SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Thinking --port 30000 --tp-size 4 --context-length 262144 --reasoning-parser deepseek-r1 --mem-fraction-static 0.8 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
200
  ```
201
 
202
- > [!Note]
203
- > The environment variable `SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1` is required at the moment.
204
-
205
  > [!Note]
206
  > The default context length is 256K.
207
  > If you encounter out-of-memory (OOM) issues, you may consider reducing the context length to a smaller value.
208
  > However, since the model may require longer token sequences for reasoning, we strongly recommend using a context length greater than 131,072.
209
 
 
 
210
  ### vLLM
211
 
212
  [vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
213
  vLLM could be used to launch a server with OpenAI-compatible API service.
214
 
215
- vLLM has supported Qwen3-Next in its `main` branch, which can be installed from source:
216
  ```shell
217
- pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
218
  ```
 
219
 
220
  The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
221
  ```shell
222
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Thinking --port 8000 --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser deepseek_r1
223
  ```
224
 
225
  The following command is recommended for MTP with the rest settings the same as above:
226
  ```shell
227
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Thinking --port 8000 --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser deepseek_r1 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
228
  ```
229
 
230
- > [!Note]
231
- > The environment variable `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` is required at the moment.
232
-
233
  > [!Note]
234
  > The default context length is 256K.
235
  > If you encounter out-of-memory (OOM) issues, you may consider reducing the context length to a smaller value.
236
  > However, since the model may require longer token sequences for reasoning, we strongly recommend using a context length greater than 131,072 when possible.
237
 
 
 
 
238
  ## Agentic Use
239
 
240
  Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
@@ -252,7 +253,7 @@ llm_cfg = {
252
 
253
  # Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing
254
  # functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example,
255
- # `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Thinking --served-model-name Qwen3-Next-80B-A3B-Thinking --port 8000 --tensor-parallel-size 4 --max-model-len 262144`.
256
  #
257
  # llm_cfg = {
258
  # 'model': 'Qwen3-Next-80B-A3B-Thinking',
 
29
 
30
  ![Qwen3-Next-80B-A3B-Thinking Benchmark Comparison](https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-Next/Qwen3-Next-80B-A3B-Thinking.001.jpeg)
31
 
32
+ For more details, please refer to our blog post [Qwen3-Next](https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list).
33
 
34
  ## Model Overview
35
 
 
48
  - Training Stage: Pretraining (15T tokens) & Post-training
49
  - Number of Parameters: 80B in total and 3B activated
50
  - Number of Paramaters (Non-Embedding): 79B
 
51
  - Hidden Dimension: 2048
52
+ - Number of Layers: 48
53
+ - Hybrid Layout: 12 \* (3 \* (Gated DeltaNet -> MoE) -> 1 \* (Gated Attention -> MoE))
54
  - Gated Attention:
55
  - Number of Attention Heads: 16 for Q and 2 for KV
56
  - Head Dimension: 256
 
173
 
174
  > [!Tip]
175
  > Depending on the inference settings, you may observe better efficiency with [`flash-linear-attention`](https://github.com/fla-org/flash-linear-attention#installation) and [`causal-conv1d`](https://github.com/Dao-AILab/causal-conv1d).
176
+ > See the links for detailed instructions and requirements.
177
 
178
  ## Deployment
179
 
 
184
  [SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for large language models and vision language models.
185
  SGLang could be used to launch a server with OpenAI-compatible API service.
186
 
187
+ `sglang>=0.5.2` is required for Qwen3-Next, which can be installed using:
188
  ```shell
189
+ pip install 'sglang[all]>=0.5.2'
190
  ```
191
+ See [its documentation](https://docs.sglang.ai/get_started/install.html) for more details.
192
 
193
  The following command can be used to create an API endpoint at `http://localhost:30000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
194
  ```shell
195
+ python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Thinking --port 30000 --tp-size 4 --context-length 262144 --reasoning-parser deepseek-r1 --mem-fraction-static 0.8
196
  ```
197
 
198
  The following command is recommended for MTP with the rest settings the same as above:
199
  ```shell
200
+ python -m sglang.launch_server --model-path Qwen/Qwen3-Next-80B-A3B-Thinking --port 30000 --tp-size 4 --context-length 262144 --reasoning-parser deepseek-r1 --mem-fraction-static 0.8 --speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
201
  ```
202
 
 
 
 
203
  > [!Note]
204
  > The default context length is 256K.
205
  > If you encounter out-of-memory (OOM) issues, you may consider reducing the context length to a smaller value.
206
  > However, since the model may require longer token sequences for reasoning, we strongly recommend using a context length greater than 131,072.
207
 
208
+ Please also refer to SGLang's usage guide on [Qwen3-Next](https://docs.sglang.ai/basic_usage/qwen3.html).
209
+
210
  ### vLLM
211
 
212
  [vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs.
213
  vLLM could be used to launch a server with OpenAI-compatible API service.
214
 
215
+ `vllm>=0.10.2` is required for Qwen3-Next, which can be installed using:
216
  ```shell
217
+ pip install 'vllm>=0.10.2'
218
  ```
219
+ See [its documentation](https://docs.vllm.ai/en/stable/getting_started/installation/index.html) for more details.
220
 
221
  The following command can be used to create an API endpoint at `http://localhost:8000/v1` with maximum context length 256K tokens using tensor parallel on 4 GPUs.
222
  ```shell
223
+ vllm serve Qwen/Qwen3-Next-80B-A3B-Thinking --port 8000 --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser deepseek_r1
224
  ```
225
 
226
  The following command is recommended for MTP with the rest settings the same as above:
227
  ```shell
228
+ vllm serve Qwen/Qwen3-Next-80B-A3B-Thinking --port 8000 --tensor-parallel-size 4 --max-model-len 262144 --reasoning-parser deepseek_r1 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
229
  ```
230
 
 
 
 
231
  > [!Note]
232
  > The default context length is 256K.
233
  > If you encounter out-of-memory (OOM) issues, you may consider reducing the context length to a smaller value.
234
  > However, since the model may require longer token sequences for reasoning, we strongly recommend using a context length greater than 131,072 when possible.
235
 
236
+ Please also refer to vLLM's usage guide on [Qwen3-Next](https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html).
237
+
238
+
239
  ## Agentic Use
240
 
241
  Qwen3 excels in tool calling capabilities. We recommend using [Qwen-Agent](https://github.com/QwenLM/Qwen-Agent) to make the best use of agentic ability of Qwen3. Qwen-Agent encapsulates tool-calling templates and tool-calling parsers internally, greatly reducing coding complexity.
 
253
 
254
  # Using OpenAI-compatible API endpoint. It is recommended to disable the reasoning and the tool call parsing
255
  # functionality of the deployment frameworks and let Qwen-Agent automate the related operations. For example,
256
+ # `vllm serve Qwen/Qwen3-Next-80B-A3B-Thinking --served-model-name Qwen3-Next-80B-A3B-Thinking --port 8000 --tensor-parallel-size 4 --max-model-len 262144`.
257
  #
258
  # llm_cfg = {
259
  # 'model': 'Qwen3-Next-80B-A3B-Thinking',