GLM-4.5-Air-REAP-82B-A12B AWQ 4-bit
This is an AWQ 4-bit quantized version of cerebras/GLM-4.5-Air-REAP-82B-A12B.
Model Details
- Base Model: cerebras/GLM-4.5-Air-REAP-82B-A12B
- Quantization Method: AWQ (Activation-Weighted Quantization)
- Quantization Precision: 4-bit integer symmetric
- Group Size: 32
- Architecture: Mixture of Experts (MoE)
- Total Parameters: 82B
- Active Parameters: 12B
- Quantization Library: llm-compressor
4*5060Ti (16GB) Quickstart
No Reasoning
sudo docker run --runtime nvidia --gpus all --detach \
--name "vLLM" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 10.1.1.10:80:80 \
--ipc=host \
-e "CUDA_VISIBLE_DEVICES=0,1,2,3" \
-e "VLLM_TARGET_DEVICE=cuda" \
-e "CUDA_DEVICE_ORDER=PCI_BUS_ID" \
-e "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True" \
-e "VLLM_USE_FLASHINFER_SAMPLER=1" \
-e "VLLM_ATTENTION_BACKEND=FLASHINFER" \
vllm/vllm-openai:latest \
--model "MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit" \
--max-model-len 131072 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--swap-space 0 \
--max-num-seqs 9 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--host 0.0.0.0 \
--port 80 \
--served-model-name "GLM-4.5-Air" \
--enable-prefix-caching \
--enable-chunked-prefill
Reasoning:
sudo docker run --runtime nvidia --gpus all --detach \
--name "vLLM" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 10.1.1.10:80:80 \
--ipc=host \
-e "CUDA_VISIBLE_DEVICES=0,1,2,3" \
-e "VLLM_TARGET_DEVICE=cuda" \
-e "CUDA_DEVICE_ORDER=PCI_BUS_ID" \
-e "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True" \
-e "VLLM_USE_FLASHINFER_SAMPLER=1" \
-e "VLLM_ATTENTION_BACKEND=FLASHINFER" \
vllm/vllm-openai:latest \
--model "MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit" \
--max-model-len 131072 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--swap-space 0 \
--max-num-seqs 9 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--host 0.0.0.0 \
--port 80 \
--served-model-name "GLM-4.5-Air" \
--enable-prefix-caching \
--enable-chunked-prefill
Quantization Configuration
This model was quantized using the following configuration:
- 4-bit integer symmetric quantization
- Group size: 32
- MSE observer
- Duo scaling: enabled
- Custom smoothing mappings for MoE architecture
- Calibration dataset: HuggingFaceH4/ultrachat_200k (256 samples)
Ignored Layers
The following layers were not quantized to preserve model quality:
- Language model head (lm_head)
- Embedding layer (model.embed_tokens)
- Input layer norms
- Post-attention layer norms
- Final norm layer
- Shared experts in MoE
- First layer (kept in full precision)
- MLP gates
Quantized Layers
The following layers were quantized to 4-bit:
- gate_proj
- up_proj
- down_proj
- k_proj (key projection)
- q_proj (query projection)
- v_proj (value projection)
- o_proj (output projection)
Usage
This model is optimized for inference with vLLM.
Installation
pip install vllm
Basic Usage
from vllm import LLM, SamplingParams
# Load the quantized model
model = LLM("MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit")
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512
)
# Generate text
prompt = "Hello, I am an AI assistant. How can I help you today?"
outputs = model.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
Advanced Usage with Chat Template
from vllm import LLM, SamplingParams
model = LLM("MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
# The model will automatically apply the chat template
outputs = model.chat(messages, sampling_params=SamplingParams(max_tokens=256))
print(outputs[0].outputs[0].text)
Performance
Model Size
- Original (FP16): ~160GB
- Quantized (AWQ 4-bit): ~40GB
- Size Reduction: ~75%
Expected Performance
- Memory Usage: ~50GB VRAM (with overhead)
- Inference Speed: 2-4x faster than FP16
- Quality: Minimal degradation (<2% perplexity increase)
Hardware Requirements
Minimum:
- 1x GPU with 48GB VRAM (e.g., RTX 6000 Ada, A6000, L40S)
Recommended:
- 2x GPUs with 48GB+ VRAM each for optimal throughput
- Or 1x GPU with 80GB+ VRAM (e.g., H100, A100 80GB)
Quantization Details
This model was quantized on 2x RTX 6000 Blackwell GPUs (96GB each) using:
- Calibration Samples: 256
- Max Sequence Length: 2048
- Calibration Time: ~30-60 minutes
- Observer: MSE (Mean Squared Error)
- Duo Scaling: Enabled for better accuracy
Limitations
- Quantization Loss: While AWQ preserves model quality well, there may be minor degradation in edge cases
- vLLM Required: This model requires vLLM for optimal inference
- CUDA Required: GPU inference only (no CPU support)
- MoE Specifics: Some MoE routing behaviors may differ slightly from the original model
Use Cases
This quantized model is suitable for:
- Production deployments with limited VRAM
- Research and experimentation
- Fine-tuning (with LoRA/QLoRA)
- Batch inference workloads
- Edge deployment scenarios
Citation
If you use this model, please cite the original GLM-4.5 paper and the llm-compressor library:
@software{llm-compressor,
title = {LLM Compressor},
author = {vLLM Team},
url = {https://github.com/vllm-project/llm-compressor},
year = {2024}
}
License
This model inherits the Apache 2.0 license from the base model cerebras/GLM-4.5-Air-REAP-82B-A12B.
Acknowledgments
- Original Model: Cerebras AI
- Quantization Method: AWQ (Activation-Weighted Quantization)
- Quantization Library: llm-compressor by vLLM team
- Hardware: 2x RTX 6000 Blackwell GPUs
Contact
For issues or questions about this quantized model, please open an issue on the model repository.
For questions about the original model, see cerebras/GLM-4.5-Air-REAP-82B-A12B.
- Downloads last month
- 367
Model tree for MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit
Base model
zai-org/GLM-4.5-Air