jerryzh168 nielsr HF Staff commited on
Commit
3eafa77
·
verified ·
1 Parent(s): 1f33f1e

Update model card: Add TorchAO paper, code, documentation links and correct license (#3)

Browse files

- Update model card: Add TorchAO paper, code, documentation links and correct license (ab2360d3ff1a26e91c4c81074b8180dd5f36bd85)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +21 -8
README.md CHANGED
@@ -1,5 +1,11 @@
1
  ---
 
 
 
 
2
  library_name: transformers
 
 
3
  tags:
4
  - torchao
5
  - phi
@@ -9,15 +15,20 @@ tags:
9
  - math
10
  - chat
11
  - conversational
12
- license: mit
13
- language:
14
- - multilingual
15
- base_model:
16
- - microsoft/Phi-4-mini-instruct
17
- pipeline_tag: text-generation
18
  ---
19
 
20
- [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team. Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 67% VRAM reduction and 1.12x-1.2x speedup on A100 GPUs.
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  # Inference with vLLM
23
  Install vllm nightly and torchao nightly to get some recent changes:
@@ -49,7 +60,9 @@ if __name__ == '__main__':
49
  # that contain the prompt, generated text, and other information.
50
  outputs = llm.generate(prompts, sampling_params)
51
  # Print the outputs.
52
- print("\nGenerated Outputs:\n" + "-" * 60)
 
 
53
  for output in outputs:
54
  prompt = output.prompt
55
  generated_text = output.outputs[0].text
 
1
  ---
2
+ base_model:
3
+ - microsoft/Phi-4-mini-instruct
4
+ language:
5
+ - multilingual
6
  library_name: transformers
7
+ license: bsd-3-clause
8
+ pipeline_tag: text-generation
9
  tags:
10
  - torchao
11
  - phi
 
15
  - math
16
  - chat
17
  - conversational
 
 
 
 
 
 
18
  ---
19
 
20
+ This repository hosts the **Phi4-mini-instruct** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) using int4 weight-only quantization and the [hqq](https://mobiusml.github.io/hqq_blog/) algorithm. This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for significant VRAM reduction and speedup on A100 GPUs.
21
+
22
+ ## Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
23
+ The model's quantization is powered by **TorchAO**, a framework presented in the paper [TorchAO: PyTorch-Native Training-to-Serving Model Optimization](https://huggingface.co/papers/2507.16099).
24
+
25
+ **Abstract:** We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL .
26
+
27
+ ## Resources
28
+ * **Official TorchAO GitHub Repository:** [https://github.com/pytorch/ao](https://github.com/pytorch/ao)
29
+ * **TorchAO Documentation:** [https://docs.pytorch.org/ao/stable/index.html](https://docs.pytorch.org/ao/stable/index.html)
30
+
31
+ ---
32
 
33
  # Inference with vLLM
34
  Install vllm nightly and torchao nightly to get some recent changes:
 
60
  # that contain the prompt, generated text, and other information.
61
  outputs = llm.generate(prompts, sampling_params)
62
  # Print the outputs.
63
+ print("
64
+ Generated Outputs:
65
+ " + "-" * 60)
66
  for output in outputs:
67
  prompt = output.prompt
68
  generated_text = output.outputs[0].text