|
--- |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- mistralai/Mixtral-8x7B-Instruct-v0.1 |
|
--- |
|
# Quark Team FP8 Mixtral-8x7B Model Overview |
|
|
|
## Model Information For MLPerf |
|
- **Model Name**: Mixtral-7x8b |
|
- **Version**: MLPerf v5.1 |
|
- **Commit**: Close Division Commit |
|
- **Supported Hardware Microarchitecture**: AMD MI300/MI325 |
|
- **ROCm**: 6.4.1 |
|
- **Operating System(s)**: Linux |
|
- **Transformers**: 4.46.3 |
|
- **Quark:** [0.9](https://quark.docs.amd.com/latest/install.html) |
|
|
|
## Calibration Dataset |
|
This model was built with mistralai Mixtral model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization. |
|
The calibration dataset consists of **1024 mixed datasets** provided by [mlcommons/inference](https://github.com/mlcommons/inference/tree/master/language/mixtral-8x7b#get-dataset), which includes: |
|
- **325 GSM8k samples** |
|
- **325 MBXP samples** |
|
- **374 OpenOcra samples** |
|
|
|
## Quantized Tensors |
|
The following tensors are quantized in each decoder: |
|
- **Expert MLP Inputs and Weights** (excluding the router) |
|
- **Linear qkv Inputs and Weight** |
|
- **KV Cache Entries** |
|
|
|
## Ignored Layers |
|
The following layers are ignored during quantization: |
|
- `*.gate` |
|
- `*.o_proj` |
|
- `lm_head` |
|
|
|
## Algorithms |
|
AutoSmoothQuant algorithm is applied in weight-activation quantization for better performance. |
|
|
|
## Quantization Scripts |
|
``` |
|
cd examples/torch/language_modeling/llm_ptq/ |
|
MODEL_DIR="mistralai/Mixtral-8x7B-Instruct-v0.1" |
|
DATASET="./mlperf_data/mixtral_8x7b%2F2024.06.06_mixtral_15k_calibration_v4.pkl" |
|
OUTPUT_DIR="amd/Mixtral-8x7B-Instruct-v0.1_FP8_MLPerf_V3" |
|
|
|
python3 quantize_quark.py --model_dir "${MODEL}" \ |
|
--output_dir "${OUTPUT_DIR}" \ |
|
--dataset "${DATASET}" \ |
|
--data_type float16 \ |
|
--multi_gpu \ |
|
--quant_scheme w_fp8_a_fp8 \ |
|
--kv_cache_dtype fp8 \ |
|
--num_calib_data 1024 \ |
|
--seq_len 1024 \ |
|
--min_kv_scale 1.0 \ |
|
--model_export hf_format \ |
|
--custom_mode fp8 \ |
|
--quant_algo autosmoothquant \ |
|
--exclude_layers "lm_head" "*.gate" "*.o_proj" |
|
``` |
|
|
|
# Model Performance Comparison |
|
|
|
| Metric | Baseline Accuracy Target (%) | FP8 Quant Accuracy (%) | |
|
|-----------------------|--------------------|-----------------------| |
|
| **GSM8K (Math)** | 73.66 | 73.18 (99.34%) | |
|
| **Open Orca (Chat)** | | | |
|
| - Rouge1 | 45.5989 | 45.4362 (99.64%) | |
|
| - Rouge2 | 23.3526 | 23.168 (99.21%) | |
|
| - RougeL | 30.4608 | 30.2922 (99.45%) | |
|
| **MBXP (Code)** | 60.16 | 60.08 (99.87%) | |
|
|
|
# License |
|
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |
|
|
|
|