MR-GPTQ
Collection
MXFP4 and NVFP4 quantized models
•
2 items
•
Updated
This model was obtained by quantizing the weights of Llama-3.1-8B-Instruct to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%.
MR-GPTQ quantized models with QuTLASS kernels are supported in the following integrations:
transformers
with these features:main
(Documentation).vLLM
with these features:FP-Quant
and the transformers
integration.This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the vLLM
engine.
OpenLLM v1 results
Model | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | Average | Recovery (%) |
---|---|---|---|---|---|---|
meta‑llama/Llama 3.1‑8B‑Instruct |
0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – |
ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑nvfp |
0.6917 | 0.8089 | 0.7850 | 0.7545 | 0.7600 | 96.29 |
Platinum bench results
Below we report recoveries on individual tasks as well as the average recovery.
Recovery by Task
Task | Recovery (%) |
---|---|
SingleOp | 100.00 |
SingleQ | 98.99 |
MultiArith | 99.41 |
SVAMP | 97.54 |
GSM8K | 96.64 |
MMLU‑Math | 92.43 |
BBH‑LogicalDeduction‑3Obj | 87.34 |
BBH‑ObjectCounting | 98.80 |
BBH‑Navigate | 92.00 |
TabFact | 86.92 |
HotpotQA | 103.18 |
SQuAD | 101.54 |
DROP | 103.77 |
Winograd‑WSC | 89.47 |
Average | 96.29 |
Base model
meta-llama/Llama-3.1-8B