File size: 3,965 Bytes
3f97082 18db4d9 3f97082 18e2f1e 18db4d9 0b2ac87 18db4d9 c0cc453 18db4d9 18e2f1e 18db4d9 f2773d2 18e2f1e 552108e 18db4d9 18e2f1e 18db4d9 18e2f1e 18db4d9 18e2f1e 18db4d9 18e2f1e 3f97082 18e2f1e 3f97082 65f6b0f 18db4d9 3f97082 18db4d9 18e2f1e 18db4d9 65f6b0f 18db4d9 18e2f1e 3f97082 65f6b0f 18db4d9 18e2f1e f2773d2 3f97082 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
license: llama3.1
---
## Introduction
This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html).
## Quickstart
To run this fp8 model on vLLM framework,
### Modle Preparation
1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container.
```sh
docker build -f Dockerfile.rocm -t vllm_test .
docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest
```
2. clone the baseline [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
3. clone this [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-70B-Instruct-fp8-quark-vllm) and inside the [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-70B-Instruct-fp8-quark-vllm) folder run this to merge the splitted llama-*.safetensors into a single llama.safetensors.
```sh
python merge.py
```
4. once the merged llama.safetensors is created, move this file and llama.json to the saved directory of [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct) by this command. Model snapshot commit# 1d54af340dc8906a2d21146191a9c184c35e47bd can be different.
```sh
cp llama.json ~/models--meta-llama--Meta-Llama-3.1-70B-Instruct/snapshots/1d54af340dc8906a2d21146191a9c184c35e47bd/.
cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-70B-Instruct/snapshots/1d54af340dc8906a2d21146191a9c184c35e47bd/.
```
### Running fp8 model
```sh
# single GPU
python run_vllm_fp8.py
# 8 GPUs
torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
```
```python
# run_vllm_fp8.py
from vllm import LLM, SamplingParams
prompt = "Write me an essay about bear and knight"
model_name="models--meta-llama--Meta-Llama-3.1-70B-Instruct/snapshots/1d54af340dc8906a2d21146191a9c184c35e47bd/"
tp=1 # single GPU
tp=8 # 8 GPUs
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors")
sampling_params = SamplingParams(
top_k=1.0,
ignore_eos=True,
max_tokens=200,
)
result = model.generate(prompt, sampling_params=sampling_params)
print(result)
```
### Running fp16 model (For comparison)
```sh
# single GPU
python run_vllm_fp16.py
# 8 GPUs
torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
```
```python
# run_vllm_fp16.py
from vllm import LLM, SamplingParams
prompt = "Write me an essay about bear and knight"
model_name="models--meta-llama--Meta-Llama-3.1-70B-Instruct/snapshots/1d54af340dc8906a2d21146191a9c184c35e47bd/"
tp=1 # single GPU
tp=8 # 8 GPUs
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16")
sampling_params = SamplingParams(
top_k=1.0,
ignore_eos=True,
max_tokens=200,
)
result = model.generate(prompt, sampling_params=sampling_params)
print(result)
```
## fp8 gemm_tuning
Will update soon.
#### License
Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. |