File size: 1,178 Bytes
15e269d
31cb921
 
 
 
 
 
 
 
15e269d
 
8601592
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
library_name: gguf
tags:
- llama
- quantized
- gptq
- evopress
model_type: llama
base_model: meta-llama/Llama-3.1-8B-Instruct
---

# Llama-3.1-8B-Instruct GGUF DASLab Quantization

This repository contains advanced quantized versions of Llama 3.1 8B Instruct using **GPTQ quantization** and **GPTQ+EvoPress optimization** from the [DASLab GGUF Toolkit](https://github.com/IST-DASLab/gguf-toolkit).

## Models

- **GPTQ Uniform**: High-quality GPTQ quantization at 2-6 bit precision
- **GPTQ+EvoPress**: Non-uniform per-layer quantization discovered via evolutionary search

## Performance

Our GPTQ-based quantization methods achieve **superior quality-compression tradeoffs** compared to standard quantization:

- **Better perplexity** at equivalent bitwidths vs. naive quantization approaches
- **Error-correcting updates** during calibration for improved accuracy
- **Optimized configurations** that allocate bits based on layer sensitivity (EvoPress)

## Usage

Compatible with llama.cpp and all GGUF-supporting inference engines. No special setup required.

**Full documentation, evaluation results, and toolkit source**: https://github.com/IST-DASLab/gguf-toolkit

---