GLM-4.5-Air-REAP-82B-A12B AWQ 4-bit

This is an AWQ 4-bit quantized version of cerebras/GLM-4.5-Air-REAP-82B-A12B.

Model Details

  • Base Model: cerebras/GLM-4.5-Air-REAP-82B-A12B
  • Quantization Method: AWQ (Activation-Weighted Quantization)
  • Quantization Precision: 4-bit integer symmetric
  • Group Size: 32
  • Architecture: Mixture of Experts (MoE)
  • Total Parameters: 82B
  • Active Parameters: 12B
  • Quantization Library: llm-compressor

4*5060Ti (16GB) Quickstart

No Reasoning

sudo docker run --runtime nvidia --gpus all --detach \
--name "vLLM" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 10.1.1.10:80:80 \
--ipc=host \
-e "CUDA_VISIBLE_DEVICES=0,1,2,3" \
-e "VLLM_TARGET_DEVICE=cuda" \
-e "CUDA_DEVICE_ORDER=PCI_BUS_ID" \
-e "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True" \
-e "VLLM_USE_FLASHINFER_SAMPLER=1" \
-e "VLLM_ATTENTION_BACKEND=FLASHINFER" \
vllm/vllm-openai:latest \
--model "MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit" \
--max-model-len 131072 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--swap-space 0 \
--max-num-seqs 9 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--host 0.0.0.0 \
--port 80 \
--served-model-name "GLM-4.5-Air" \
--enable-prefix-caching \
--enable-chunked-prefill

Reasoning:

sudo docker run --runtime nvidia --gpus all --detach \
--name "vLLM" \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 10.1.1.10:80:80 \
--ipc=host \
-e "CUDA_VISIBLE_DEVICES=0,1,2,3" \
-e "VLLM_TARGET_DEVICE=cuda" \
-e "CUDA_DEVICE_ORDER=PCI_BUS_ID" \
-e "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True" \
-e "VLLM_USE_FLASHINFER_SAMPLER=1" \
-e "VLLM_ATTENTION_BACKEND=FLASHINFER" \
vllm/vllm-openai:latest \
--model "MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit" \
--max-model-len 131072 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--swap-space 0 \
--max-num-seqs 9 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--host 0.0.0.0 \
--port 80 \
--served-model-name "GLM-4.5-Air" \
--enable-prefix-caching \
--enable-chunked-prefill

Quantization Configuration

This model was quantized using the following configuration:

- 4-bit integer symmetric quantization
- Group size: 32
- MSE observer
- Duo scaling: enabled
- Custom smoothing mappings for MoE architecture
- Calibration dataset: HuggingFaceH4/ultrachat_200k (256 samples)

Ignored Layers

The following layers were not quantized to preserve model quality:

  • Language model head (lm_head)
  • Embedding layer (model.embed_tokens)
  • Input layer norms
  • Post-attention layer norms
  • Final norm layer
  • Shared experts in MoE
  • First layer (kept in full precision)
  • MLP gates

Quantized Layers

The following layers were quantized to 4-bit:

  • gate_proj
  • up_proj
  • down_proj
  • k_proj (key projection)
  • q_proj (query projection)
  • v_proj (value projection)
  • o_proj (output projection)

Usage

This model is optimized for inference with vLLM.

Installation

pip install vllm

Basic Usage

from vllm import LLM, SamplingParams

# Load the quantized model
model = LLM("MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit")

# Configure sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Generate text
prompt = "Hello, I am an AI assistant. How can I help you today?"
outputs = model.generate([prompt], sampling_params)

print(outputs[0].outputs[0].text)

Advanced Usage with Chat Template

from vllm import LLM, SamplingParams

model = LLM("MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What is the capital of France?"}
]

# The model will automatically apply the chat template
outputs = model.chat(messages, sampling_params=SamplingParams(max_tokens=256))
print(outputs[0].outputs[0].text)

Performance

Model Size

  • Original (FP16): ~160GB
  • Quantized (AWQ 4-bit): ~40GB
  • Size Reduction: ~75%

Expected Performance

  • Memory Usage: ~50GB VRAM (with overhead)
  • Inference Speed: 2-4x faster than FP16
  • Quality: Minimal degradation (<2% perplexity increase)

Hardware Requirements

Minimum:

  • 1x GPU with 48GB VRAM (e.g., RTX 6000 Ada, A6000, L40S)

Recommended:

  • 2x GPUs with 48GB+ VRAM each for optimal throughput
  • Or 1x GPU with 80GB+ VRAM (e.g., H100, A100 80GB)

Quantization Details

This model was quantized on 2x RTX 6000 Blackwell GPUs (96GB each) using:

  • Calibration Samples: 256
  • Max Sequence Length: 2048
  • Calibration Time: ~30-60 minutes
  • Observer: MSE (Mean Squared Error)
  • Duo Scaling: Enabled for better accuracy

Limitations

  1. Quantization Loss: While AWQ preserves model quality well, there may be minor degradation in edge cases
  2. vLLM Required: This model requires vLLM for optimal inference
  3. CUDA Required: GPU inference only (no CPU support)
  4. MoE Specifics: Some MoE routing behaviors may differ slightly from the original model

Use Cases

This quantized model is suitable for:

  • Production deployments with limited VRAM
  • Research and experimentation
  • Fine-tuning (with LoRA/QLoRA)
  • Batch inference workloads
  • Edge deployment scenarios

Citation

If you use this model, please cite the original GLM-4.5 paper and the llm-compressor library:

@software{llm-compressor,
  title = {LLM Compressor},
  author = {vLLM Team},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}

License

This model inherits the Apache 2.0 license from the base model cerebras/GLM-4.5-Air-REAP-82B-A12B.

Acknowledgments

  • Original Model: Cerebras AI
  • Quantization Method: AWQ (Activation-Weighted Quantization)
  • Quantization Library: llm-compressor by vLLM team
  • Hardware: 2x RTX 6000 Blackwell GPUs

Contact

For issues or questions about this quantized model, please open an issue on the model repository.

For questions about the original model, see cerebras/GLM-4.5-Air-REAP-82B-A12B.

Downloads last month
367
Safetensors
Model size
13B params
Tensor type
BF16
I64
F32
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for MidnightPhreaker/GLM-4.5-Air-REAP-82B-A12B-AWQ-4bit

Quantized
(15)
this model