GLM-4.5-Air-REAP-82B-A12B AWQ 4-bit

This is an AWQ 4-bit quantized version of cerebras/GLM-4.5-Air-REAP-82B-A12B.

Model Details

Base Model: cerebras/GLM-4.5-Air-REAP-82B-A12B
Quantization Method: AWQ (Activation-Weighted Quantization)
Quantization Precision: 4-bit integer symmetric
Group Size: 32
Architecture: Mixture of Experts (MoE)
Total Parameters: 82B
Active Parameters: 12B
Quantization Library: llm-compressor

4*5060Ti (16GB) Quickstart

No Reasoning

sudo docker run --runtime nvidia --gpus all --detach \
--name "vLLM" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 10.1.1.10:80:80 \
--ipc=host \
-e "CUDA_VISIBLE_DEVICES=0,1,2,3" \
-e "VLLM_TARGET_DEVICE=cuda" \
-e "CUDA_DEVICE_ORDER=PCI_BUS_ID" \
-e "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True" \
-e "VLLM_USE_FLASHINFER_SAMPLER=1" \
-e "VLLM_ATTENTION_BACKEND=FLASHINFER" \
vllm/vllm-openai:latest \
--model "MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit" \
--max-model-len 131072 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--swap-space 0 \
--max-num-seqs 9 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--host 0.0.0.0 \
--port 80 \
--served-model-name "GLM-4.5-Air" \
--enable-prefix-caching \
--enable-chunked-prefill

Reasoning:

sudo docker run --runtime nvidia --gpus all --detach \
--name "vLLM" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 10.1.1.10:80:80 \
--ipc=host \
-e "CUDA_VISIBLE_DEVICES=0,1,2,3" \
-e "VLLM_TARGET_DEVICE=cuda" \
-e "CUDA_DEVICE_ORDER=PCI_BUS_ID" \
-e "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True" \
-e "VLLM_USE_FLASHINFER_SAMPLER=1" \
-e "VLLM_ATTENTION_BACKEND=FLASHINFER" \
vllm/vllm-openai:latest \
--model "MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit" \
--max-model-len 131072 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--swap-space 0 \
--max-num-seqs 9 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--host 0.0.0.0 \
--port 80 \
--served-model-name "GLM-4.5-Air" \
--enable-prefix-caching \
--enable-chunked-prefill

Quantization Configuration

This model was quantized using the following configuration:

- 4-bit integer symmetric quantization
- Group size: 32
- MSE observer
- Duo scaling: enabled
- Custom smoothing mappings for MoE architecture
- Calibration dataset: HuggingFaceH4/ultrachat_200k (256 samples)

Ignored Layers

The following layers were not quantized to preserve model quality:

Language model head (lm_head)
Embedding layer (model.embed_tokens)
Input layer norms
Post-attention layer norms
Final norm layer
Shared experts in MoE
First layer (kept in full precision)
MLP gates

Quantized Layers

The following layers were quantized to 4-bit:

gate_proj
up_proj
down_proj
k_proj (key projection)
q_proj (query projection)
v_proj (value projection)
o_proj (output projection)

Usage

This model is optimized for inference with vLLM.

Installation

pip install vllm

Basic Usage

from vllm import LLM, SamplingParams

# Load the quantized model
model = LLM("MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit")

# Configure sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Generate text
prompt = "Hello, I am an AI assistant. How can I help you today?"
outputs = model.generate([prompt], sampling_params)

print(outputs[0].outputs[0].text)

Advanced Usage with Chat Template

from vllm import LLM, SamplingParams

model = LLM("MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

# The model will automatically apply the chat template
outputs = model.chat(messages, sampling_params=SamplingParams(max_tokens=256))
print(outputs[0].outputs[0].text)

Performance

Model Size

Original (FP16): ~160GB
Quantized (AWQ 4-bit): ~40GB
Size Reduction: ~75%

Expected Performance

Memory Usage: ~50GB VRAM (with overhead)
Inference Speed: 2-4x faster than FP16
Quality: Minimal degradation (<2% perplexity increase)

Hardware Requirements

Minimum:

1x GPU with 48GB VRAM (e.g., RTX 6000 Ada, A6000, L40S)

Recommended:

2x GPUs with 48GB+ VRAM each for optimal throughput
Or 1x GPU with 80GB+ VRAM (e.g., H100, A100 80GB)

Quantization Details

This model was quantized on 2x RTX 6000 Blackwell GPUs (96GB each) using:

Calibration Samples: 256
Max Sequence Length: 2048
Calibration Time: ~30-60 minutes
Observer: MSE (Mean Squared Error)
Duo Scaling: Enabled for better accuracy

Limitations

Quantization Loss: While AWQ preserves model quality well, there may be minor degradation in edge cases
vLLM Required: This model requires vLLM for optimal inference
CUDA Required: GPU inference only (no CPU support)
MoE Specifics: Some MoE routing behaviors may differ slightly from the original model

Use Cases

This quantized model is suitable for:

Production deployments with limited VRAM
Research and experimentation
Fine-tuning (with LoRA/QLoRA)
Batch inference workloads
Edge deployment scenarios

Citation

If you use this model, please cite the original GLM-4.5 paper and the llm-compressor library:

@software{llm-compressor,
  title = {LLM Compressor},
  author = {vLLM Team},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}

License

This model inherits the Apache 2.0 license from the base model cerebras/GLM-4.5-Air-REAP-82B-A12B.

Acknowledgments

Original Model: Cerebras AI
Quantization Method: AWQ (Activation-Weighted Quantization)
Quantization Library: llm-compressor by vLLM team
Hardware: 2x RTX 6000 Blackwell GPUs

Contact

For issues or questions about this quantized model, please open an issue on the model repository.

For questions about the original model, see cerebras/GLM-4.5-Air-REAP-82B-A12B.

Downloads last month: 367

Safetensors

Model size

13B params

Tensor type

BF16

I64

F32

I32

Model tree for MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit

Base model

zai-org/GLM-4.5-Air

Finetuned

cerebras/GLM-4.5-Air-REAP-82B-A12B

Quantized

(15)

this model