DESUCLUB/Llama-3.1-8B-Instruct-quantized.w8a8

This is a custom W8A16 quantized version of meta-llama/Llama-3.1-8B-Instruct.

Quantization Details

  • Method: Custom W8A16 (8-bit weights, 16-bit activations)
  • Weight precision: INT8
  • Scale precision: BF16
  • Quantization: Symmetric per-channel
  • Zero points: None (symmetric)

Model Structure

The quantized model contains:

  • .weight: INT8 quantized weights
  • .weight_scale: BF16 scale parameters (trainable)
  • Standard embedding and normalization layers in original precision

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Note: This requires custom quantization code to load properly
model = AutoModelForCausalLM.from_pretrained("DESUCLUB/Llama-3.1-8B-Instruct-quantized.w8a8")
tokenizer = AutoTokenizer.from_pretrained("DESUCLUB/Llama-3.1-8B-Instruct-quantized.w8a8")
Downloads last month
12
Safetensors
Model size
8.03B params
Tensor type
F32
BF16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for DESUCLUB/Llama-3.1-8B-Instruct-quantized.w8a8

Quantized
(470)
this model