File size: 3,107 Bytes

19c1d68
 
 
 
 
 
 
 
 
 
 
eb42dad
19c1d68
5b1d7fe
0237317
0a4601a
7b66cd6
5b1d7fe
19c1d68
 
8fd5959
464058f
19c1d68
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0eb1eaa
 
 
5b1d7fe
 
 
 
 
7b66cd6
5b1d7fe
 
 
 
 
 
 
 
 
 
 
 
 
 
aef98c2
5b1d7fe
 
19c1d68
 
 
 
 
 
 
 
 
 
 
1b43a64
 
19c1d68

---
license: apache-2.0
metrics:
- accuracy
base_model:
- mistralai/Mixtral-8x7B-Instruct-v0.1
---
# Quark Team FP8 Mixtral-8x7B Model Overview

## Model Information For MLPerf
- **Model Name**: Mixtral-7x8b
- **Version**: MLPerf v5.1
- **Commit**: Close Division Commit
- **Supported Hardware Microarchitecture**: AMD MI300/MI325
- **ROCm**: 6.4.1
- **Operating System(s)**: Linux
- **Transformers**: 4.46.3
- **Quark:** [0.9](https://quark.docs.amd.com/latest/install.html)

## Calibration Dataset
This model was built with mistralai Mixtral model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
The calibration dataset consists of **1024 mixed datasets** provided by [mlcommons/inference](https://github.com/mlcommons/inference/tree/master/language/mixtral-8x7b#get-dataset), which includes:
- **325 GSM8k samples**
- **325 MBXP samples**
- **374 OpenOcra samples**

## Quantized Tensors
The following tensors are quantized in each decoder:
- **Expert MLP Inputs and Weights** (excluding the router)
- **Linear qkv Inputs and Weight**
- **KV Cache Entries**

## Ignored Layers
The following layers are ignored during quantization:
- `*.gate`
- `*.o_proj`
- `lm_head`

## Algorithms
AutoSmoothQuant algorithm is applied in weight-activation quantization for better performance.

## Quantization Scripts
```
cd examples/torch/language_modeling/llm_ptq/
MODEL_DIR="mistralai/Mixtral-8x7B-Instruct-v0.1"
DATASET="./mlperf_data/mixtral_8x7b%2F2024.06.06_mixtral_15k_calibration_v4.pkl"
OUTPUT_DIR="amd/Mixtral-8x7B-Instruct-v0.1_FP8_MLPerf_V3"

python3 quantize_quark.py --model_dir "${MODEL}" \
                          --output_dir "${OUTPUT_DIR}" \
                          --dataset "${DATASET}" \
                          --data_type float16 \
                          --multi_gpu \
                          --quant_scheme w_fp8_a_fp8 \
                          --kv_cache_dtype fp8 \
                          --num_calib_data 1024 \
                          --seq_len 1024 \
                          --min_kv_scale 1.0 \
                          --model_export hf_format \
                          --custom_mode fp8 \
                          --quant_algo autosmoothquant \
                          --exclude_layers "lm_head" "*.gate" "*.o_proj"
```

# Model Performance Comparison

| Metric                | Baseline Accuracy Target (%) | FP8 Quant Accuracy  (%) | 
|-----------------------|--------------------|-----------------------|
| **GSM8K (Math)**             | 73.66              | 73.18 (99.34%)                 | 
| **Open Orca (Chat)**         |                    |                       |                   
| - Rouge1             | 45.5989            |  45.4362 (99.64%)                | 
| - Rouge2             | 23.3526         | 23.168 (99.21%)                | 
| - RougeL             | 30.4608           | 30.2922 (99.45%)               | 
| **MBXP (Code)**              | 60.16              |  60.08 (99.87%)                 | 

# License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.