---
language: en
license: apache-2.0
tags:
- nvfp4
- blackwell
- modelopt
- quantized
- gpt-oss
- ptq
- nvidia
- huggingface
model-index:
- name: GPT-OSS 120B NVFP4
  results:
  - task:
      type: text-generation
      name: Causal Language Modeling
    dataset:
      name: wikitext-103 (v1)
      type: wikitext
    metrics:
    - name: perplexity
      type: perplexity
      value: null
base_model:
- openai/gpt-oss-120b
---

Having trouble getting this running. Maybe skip until this warning is gone. I'm not sure I quantized this right.

# GPT-OSS 120B — NVFP4 (Blackwell FP4) Quantized

Quantized end-to-end using NVIDIA ModelOpt 0.37.0.

This repository contains a **post-training quantized (PTQ)** version of the open-source GPT-OSS 120B model, converted into **NVFP4**, the new 4-bit floating-point format introduced for NVIDIA Blackwell-series GPUs (RTX 5090 / Pro 6000 / B200 / DGX Spark).

---

## 📦 Model Summary

| Property | Value |
|-----------|-------|
| **Base Model** | [GPT-OSS 120B](https://huggingface.co/gpt-oss/gpt-oss-120b) |
| **Quantization Algorithm** | `NVFP4` |
| **Quantization Method** | Post-Training Quantization (PTQ) via NVIDIA ModelOpt v0.37.0 |
| **Group Size** | 16 |
| **Excluded Modules** | `lm_head` (kept BF16 for generation stability) |
| **KV Cache Quantization** | None (not yet supported in ModelOpt PTQ) |
| **Calibration Dataset** | Wikitext-103 (V1) — 1024 samples, 512 tokens each |
| **Export Format** | Hugging Face Transformers checkpoint with `.safetensors` |
| **License** | Apache 2.0 (inherited from original model) |

---

## 🧠 What Is NVFP4?

**NVFP4** is NVIDIA’s new 4-bit floating-point tensor format introduced with the **Blackwell** architecture.  
It offers ~2× the throughput of FP8 and up to 4× memory savings versus BF16, with higher dynamic range than traditional INT4.  
This model uses the official **ModelOpt NVFP4 kernel path**, so all weights and scales are compatible with:
- **TensorRT-LLM 0.12+**
- **NVIDIA Transformer Engine 3.0+**
- Future **Hugging Face Transformers FP4** runtimes.

---

## 🧪 Quantization Process

Calibration was performed with forward passes only (no retraining).
Weights were quantized per-channel, group size 16, using round-to-nearest NVFP4 encoding.
`lm_head` was left unquantized to avoid generation degradation.

---

## ⚙️ File Structure

```
config.json
generation_config.json
hf_quant_config.json
model-00001-of-00015.safetensors
...
model-00015-of-00015.safetensors
```

All tensors are stored in `.safetensors` format.
The `hf_quant_config.json` confirms:

```json
{
  "producer": {"name": "modelopt", "version": "0.37.0"},
  "quantization": {"quant_algo": "NVFP4", "group_size": 16, "exclude_modules": ["lm_head"]}
}
```

---

## 🧩 Usage Examples

### With Hugging Face Transformers (coming soon)

When FP4 support lands in Transformers v5:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Firworks/gpt-oss-120b-nvfp4",
    torch_dtype="fp4",          # future alias for NVFP4
    device_map="auto"
)
tok = AutoTokenizer.from_pretrained("gpt-oss/gpt-oss-120b")

prompt = "Explain why NVFP4 quantization matters:"
out = model.generate(**tok(prompt, return_tensors="pt"))
print(tok.decode(out[0], skip_special_tokens=True))
```

### With TensorRT-LLM 0.12+

```bash
trtllm-build --checkpoint-dir ./gpt-oss-120b-nvfp4 --dtype fp4
trtllm-bench --model ./build
```

---

## ⚡ Performance & Memory Footprint

| Precision           | Disk Size | Speed (Blackwell FP4 Core) | Notes                    |
| ------------------- | --------- | -------------------------- | ------------------------ |
| **BF16 (original)** | ~180 GB   | 1× baseline                | Reference                |
| **NVFP4 (this)**    | ~70 GB    | ≈ 4–6× faster              | Native NVFP4 tensor path |

---

## 📜 License & Attribution

* **Base Model:** [GPT-OSS 120B](https://huggingface.co/gpt-oss/gpt-oss-120b)
  © 2024 The GPT-OSS Contributors, licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
* **Quantized Variant:** Produced by [Firworks](https://huggingface.co/Firworks) using NVIDIA ModelOpt 0.37.0.
  This derived work is also released under **Apache 2.0**.

```
Copyright 2025 Firworks

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
```

---

## 🧭 Notes

* Quantization performed entirely offline using CPU-fallback PTQ.
* Intended primarily for experimentation, validation, and early NVFP4 benchmarks.
* Inference requires **Blackwell-generation GPUs** or newer with FP4 support.

---

### 🏁 Citation

If you use this work, please cite:

> Firworks (2025). *GPT-OSS 120B NVFP4: Blackwell FP4 Quantized Model via ModelOpt PTQ.* Hugging Face Hub.

---

```