File size: 3,107 Bytes
19c1d68 eb42dad 19c1d68 5b1d7fe 0237317 0a4601a 7b66cd6 5b1d7fe 19c1d68 8fd5959 464058f 19c1d68 0eb1eaa 5b1d7fe 7b66cd6 5b1d7fe aef98c2 5b1d7fe 19c1d68 1b43a64 19c1d68 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
license: apache-2.0
metrics:
- accuracy
base_model:
- mistralai/Mixtral-8x7B-Instruct-v0.1
---
# Quark Team FP8 Mixtral-8x7B Model Overview
## Model Information For MLPerf
- **Model Name**: Mixtral-7x8b
- **Version**: MLPerf v5.1
- **Commit**: Close Division Commit
- **Supported Hardware Microarchitecture**: AMD MI300/MI325
- **ROCm**: 6.4.1
- **Operating System(s)**: Linux
- **Transformers**: 4.46.3
- **Quark:** [0.9](https://quark.docs.amd.com/latest/install.html)
## Calibration Dataset
This model was built with mistralai Mixtral model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
The calibration dataset consists of **1024 mixed datasets** provided by [mlcommons/inference](https://github.com/mlcommons/inference/tree/master/language/mixtral-8x7b#get-dataset), which includes:
- **325 GSM8k samples**
- **325 MBXP samples**
- **374 OpenOcra samples**
## Quantized Tensors
The following tensors are quantized in each decoder:
- **Expert MLP Inputs and Weights** (excluding the router)
- **Linear qkv Inputs and Weight**
- **KV Cache Entries**
## Ignored Layers
The following layers are ignored during quantization:
- `*.gate`
- `*.o_proj`
- `lm_head`
## Algorithms
AutoSmoothQuant algorithm is applied in weight-activation quantization for better performance.
## Quantization Scripts
```
cd examples/torch/language_modeling/llm_ptq/
MODEL_DIR="mistralai/Mixtral-8x7B-Instruct-v0.1"
DATASET="./mlperf_data/mixtral_8x7b%2F2024.06.06_mixtral_15k_calibration_v4.pkl"
OUTPUT_DIR="amd/Mixtral-8x7B-Instruct-v0.1_FP8_MLPerf_V3"
python3 quantize_quark.py --model_dir "${MODEL}" \
--output_dir "${OUTPUT_DIR}" \
--dataset "${DATASET}" \
--data_type float16 \
--multi_gpu \
--quant_scheme w_fp8_a_fp8 \
--kv_cache_dtype fp8 \
--num_calib_data 1024 \
--seq_len 1024 \
--min_kv_scale 1.0 \
--model_export hf_format \
--custom_mode fp8 \
--quant_algo autosmoothquant \
--exclude_layers "lm_head" "*.gate" "*.o_proj"
```
# Model Performance Comparison
| Metric | Baseline Accuracy Target (%) | FP8 Quant Accuracy (%) |
|-----------------------|--------------------|-----------------------|
| **GSM8K (Math)** | 73.66 | 73.18 (99.34%) |
| **Open Orca (Chat)** | | |
| - Rouge1 | 45.5989 | 45.4362 (99.64%) |
| - Rouge2 | 23.3526 | 23.168 (99.21%) |
| - RougeL | 30.4608 | 30.2922 (99.45%) |
| **MBXP (Code)** | 60.16 | 60.08 (99.87%) |
# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
|