Gemma-3-1B-IT BitsAndBytesConfig NF4 Quantized

This model is a quantized version of google/gemma-3-1b-it-qat-int4-unquantized using BitsAndBytesConfig with NF4 quantization.

Model Details

  • Base Model: google/gemma-3-1b-it-qat-int4-unquantized
  • Quantization: BitsAndBytesConfig NF4 (4-bit)
  • Quantization Type: NF4 with double quantization
  • Compute Dtype: bfloat16
  • Storage Dtype: uint8

Quantization Configuration

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_storage=torch.uint8
)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    "WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Benefits

  • Reduced Memory Usage: ~75% reduction in memory footprint compared to full precision
  • Faster Inference: Optimized for inference speed
  • Maintained Quality: NF4 quantization preserves model quality effectively

Hardware Requirements

  • GPU Memory: ~3-4GB VRAM (vs ~12GB for FP16)
  • CUDA Compatible: Requires CUDA-capable GPU for optimal performance
  • CPU Fallback: Can run on CPU with reduced performance

Quantization Details

This model uses BitsAndBytesConfig for 4-bit quantization:

  • NF4 (Normal Float 4) quantization for optimal quality/size trade-off
  • Double quantization for additional compression
  • Mixed precision with bfloat16 compute dtype

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
6
Safetensors
Model size
662M params
Tensor type
F32
BF16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4

Quantized
(4)
this model