jdopensource
/

JoyAI-LLM-Flash

Text Generation

joyai_llm_flash

Model card Files Files and versions

add tp1 deployment

#8

by Mingke977 - opened 2 days ago

base: refs/heads/main

←

from: refs/pr/8

Discussion Files changed

Files changed (1) hide show

docs/deploy_guidance.md +16 -3

docs/deploy_guidance.md CHANGED Viewed

@@ -7,7 +7,7 @@
 ## vLLM Deployment
-Here is the example to serve this model on a H200 single node with TP8 via vLLM:
 1. pull the Docker image.
 ```bash
@@ -15,6 +15,12 @@ docker pull jdopensource/joyai-llm-vllm:v0.13.0-joyai_llm_flash
 ```
 2. launch JoyAI-LLM Flash model with dense MTP.
 ```bash
 vllm serve ${MODEL_PATH} --tp 8 --trust-remote-code \
   --tool-call-parser qwen3_coder --enable-auto-tool-choice \
   --speculative-config $'{"method": "mtp", "num_speculative_tokens": 3}'
@@ -24,7 +30,7 @@ vllm serve ${MODEL_PATH} --tp 8 --trust-remote-code \
 ## SGLang Deployment
-Similarly, here is the example to run with TP8 on H200 in a single node via SGLang:
 1. pull the Docker image.
 ```bash
@@ -33,10 +39,17 @@ docker pull jdopensource/joyai-llm-sglang:v0.5.8-joyai_llm_flash
 2. launch JoyAI-LLM Flash model with dense MTP.
 ```bash
-python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 8 --trust-remote-code \
   --tool-call-parser qwen3_coder \
   --speculative-algorithm EAGLE --speculative-draft-model-path ${MTP_MODEL_PATH} \
   --speculative-num-steps 2 --speculative-eagle-topk 2 --speculative-num-draft-tokens 3
 ```
 **Key notes:**
 - `--tool-call-parser qwen3_coder`: Required when enabling tool usage.

 ## vLLM Deployment
+Here is the example to serve this model on a H200 single node via vLLM:
 1. pull the Docker image.
 ```bash
 ```
 2. launch JoyAI-LLM Flash model with dense MTP.
 ```bash
+# TP1 for memory efficiency
+vllm serve ${MODEL_PATH} --tp 1 --trust-remote-code \
+   --tool-call-parser qwen3_coder --enable-auto-tool-choice \
+   --speculative-config $'{"method": "mtp", "num_speculative_tokens": 3}'
+# TP8 for extreme speed and long context
 vllm serve ${MODEL_PATH} --tp 8 --trust-remote-code \
   --tool-call-parser qwen3_coder --enable-auto-tool-choice \
   --speculative-config $'{"method": "mtp", "num_speculative_tokens": 3}'
 ## SGLang Deployment
+Similarly, here is the example to run on a H200 single node via SGLang:
 1. pull the Docker image.
 ```bash
 2. launch JoyAI-LLM Flash model with dense MTP.
 ```bash
+# TP1 for memory efficiency
+python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 1 --trust-remote-code \
   --tool-call-parser qwen3_coder \
   --speculative-algorithm EAGLE --speculative-draft-model-path ${MTP_MODEL_PATH} \
   --speculative-num-steps 2 --speculative-eagle-topk 2 --speculative-num-draft-tokens 3
+# TP8 for extreme speed and long context
+python3 -m sglang.launch_server --model-path ${MODEL_PATH} --tp-size 8 --trust-remote-code \
+  --tool-call-parser qwen3_coder \
+  --speculative-algorithm EAGLE \
+  --speculative-num-steps 2 --speculative-eagle-topk 2 --speculative-num-draft-tokens 3
 ```
 **Key notes:**
 - `--tool-call-parser qwen3_coder`: Required when enabling tool usage.