AutoRound(https://github.com/intel/auto-round) has been integrated into vLLM , allowing you to run AutoRound-formatted models directly in the upcoming release.
Beside, we strongly recommend using AutoRound to generate AWQ INT4 models, as AutoAWQ is no longer maintained and manually configuring new models is not trivial due to the need for custom layer mappings.
AutoRound(https://github.com/intel/auto-round) has been integrated into Transformers, allowing you to run AutoRound-formatted models directly in the upcoming release. Additionally, we are actively working on supporting the GGUF double-quant format, e.g. q4_k_s, stay tuned!
Check out [DeepSeek-R1 INT2 model(OPEA/DeepSeek-R1-int2-mixed-sym-inc). This 200GB DeepSeek-R1 model shows only about a 2% drop in MMLU, though it's quite slow due to kernel issue.
AutoRound has demonstrated strong results even at 2-bit precision for VLM models like QWEN2-VL-72B. Check it out here: OPEA/Qwen2-VL-72B-Instruct-int2-sym-inc.
This week, OPEA Space released several new INT4 models, including: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF allenai/OLMo-2-1124-13B-Instruct THUDM/glm-4v-9b AIDC-AI/Marco-o1 and several others. Let us know which models you'd like prioritized for quantization, and we'll do our best to make it happen!