Llama-3.1-8B-Instruct-MR-GPTQ-nvfp

Model Overview

This model was obtained by quantizing the weights of Llama-3.1-8B-Instruct to NVFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.5, reducing the disk size and GPU memory requirements by approximately 72%.

Usage

MR-GPTQ quantized models with QuTLASS kernels are supported in the following integrations:

  • transformers with these features:
    • Available in main (Documentation).
    • RTN on-the-fly quantization.
    • Pseudo-quantization QAT.
  • vLLM with these features:
    • Available in this PR.
    • Compatible with real quantization models from FP-Quant and the transformers integration.

Evaluation

This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the vLLM engine.

OpenLLM v1 results

Model MMLU‑CoT GSM8k Hellaswag Winogrande Average Recovery (%)
meta‑llama/Llama 3.1‑8B‑Instruct 0.7276 0.8506 0.8001 0.7790 0.7893
ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑nvfp 0.6917 0.8089 0.7850 0.7545 0.7600 96.29

Platinum bench results

Below we report recoveries on individual tasks as well as the average recovery.

Recovery by Task

Task Recovery (%)
SingleOp 100.00
SingleQ 98.99
MultiArith 99.41
SVAMP 97.54
GSM8K 96.64
MMLU‑Math 92.43
BBH‑LogicalDeduction‑3Obj 87.34
BBH‑ObjectCounting 98.80
BBH‑Navigate 92.00
TabFact 86.92
HotpotQA 103.18
SQuAD 101.54
DROP 103.77
Winograd‑WSC 89.47
Average 96.29
Downloads last month
59
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp

Quantized
(506)
this model

Collection including ISTA-DASLab/Llama-3.1-8B-Instruct-MR-GPTQ-nvfp