Gemma-3-1B-IT BitsAndBytesConfig NF4 Quantized

This model is a quantized version of google/gemma-3-1b-it-qat-int4-unquantized using BitsAndBytesConfig with NF4 quantization.

Model Details

Base Model: google/gemma-3-1b-it-qat-int4-unquantized
Quantization: BitsAndBytesConfig NF4 (4-bit)
Quantization Type: NF4 with double quantization
Compute Dtype: bfloat16
Storage Dtype: uint8

Quantization Configuration

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_storage=torch.uint8
)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(
    "WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4")

# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Benefits

Reduced Memory Usage: ~75% reduction in memory footprint compared to full precision
Faster Inference: Optimized for inference speed
Maintained Quality: NF4 quantization preserves model quality effectively

Hardware Requirements

GPU Memory: ~3-4GB VRAM (vs ~12GB for FP16)
CUDA Compatible: Requires CUDA-capable GPU for optimal performance
CPU Fallback: Can run on CPU with reduced performance

Quantization Details

This model uses BitsAndBytesConfig for 4-bit quantization:

NF4 (Normal Float 4) quantization for optimal quality/size trade-off
Double quantization for additional compression
Mixed precision with bfloat16 compute dtype

License

This model inherits the Apache 2.0 license from the base model.

WaveCut
/

gemma-3-1b-it-qat-int4-bnb-nf4