GPT-OSS 120B — NVFP4 (Blackwell FP4) Quantized

Quantized end-to-end using NVIDIA ModelOpt 0.37.0.

This repository contains a post-training quantized (PTQ) version of the open-source GPT-OSS 120B model, converted into NVFP4, the new 4-bit floating-point format introduced for NVIDIA Blackwell-series GPUs (RTX 5090 / Pro 6000 / B200 / DGX Spark).

📦 Model Summary

Property	Value
Base Model	GPT-OSS 120B
Quantization Algorithm	`NVFP4`
Quantization Method	Post-Training Quantization (PTQ) via NVIDIA ModelOpt v0.37.0
Group Size	16
Excluded Modules	`lm_head` (kept BF16 for generation stability)
KV Cache Quantization	None (not yet supported in ModelOpt PTQ)
Calibration Dataset	Wikitext-103 (V1) — 1024 samples, 512 tokens each
Export Format	Hugging Face Transformers checkpoint with `.safetensors`
License	Apache 2.0 (inherited from original model)

🧠 What Is NVFP4?

NVFP4 is NVIDIA’s new 4-bit floating-point tensor format introduced with the Blackwell architecture.
It offers ~2× the throughput of FP8 and up to 4× memory savings versus BF16, with higher dynamic range than traditional INT4.
This model uses the official ModelOpt NVFP4 kernel path, so all weights and scales are compatible with:

TensorRT-LLM 0.12+
NVIDIA Transformer Engine 3.0+
Future Hugging Face Transformers FP4 runtimes.

🧪 Quantization Process

Calibration was performed with forward passes only (no retraining). Weights were quantized per-channel, group size 16, using round-to-nearest NVFP4 encoding. lm_head was left unquantized to avoid generation degradation.

⚙️ File Structure

config.json
generation_config.json
hf_quant_config.json
model-00001-of-00015.safetensors
...
model-00015-of-00015.safetensors

All tensors are stored in .safetensors format. The hf_quant_config.json confirms:

{
  "producer": {"name": "modelopt", "version": "0.37.0"},
  "quantization": {"quant_algo": "NVFP4", "group_size": 16, "exclude_modules": ["lm_head"]}
}

🧩 Usage Examples

With Hugging Face Transformers (coming soon)

When FP4 support lands in Transformers v5:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Firworks/gpt-oss-120b-nvfp4",
    torch_dtype="fp4",          # future alias for NVFP4
    device_map="auto"
)
tok = AutoTokenizer.from_pretrained("gpt-oss/gpt-oss-120b")

prompt = "Explain why NVFP4 quantization matters:"
out = model.generate(**tok(prompt, return_tensors="pt"))
print(tok.decode(out[0], skip_special_tokens=True))

With TensorRT-LLM 0.12+

trtllm-build --checkpoint-dir ./gpt-oss-120b-nvfp4 --dtype fp4
trtllm-bench --model ./build

⚡ Performance & Memory Footprint

Precision	Disk Size	Speed (Blackwell FP4 Core)	Notes
BF16 (original)	~180 GB	1× baseline	Reference
NVFP4 (this)	~70 GB	≈ 4–6× faster	Native NVFP4 tensor path

📜 License & Attribution

Base Model: GPT-OSS 120B © 2024 The GPT-OSS Contributors, licensed under Apache 2.0.
Quantized Variant: Produced by Firworks using NVIDIA ModelOpt 0.37.0. This derived work is also released under Apache 2.0.

Copyright 2025 Firworks

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0

🧭 Notes

Quantization performed entirely offline using CPU-fallback PTQ.
Intended primarily for experimentation, validation, and early NVFP4 benchmarks.
Inference requires Blackwell-generation GPUs or newer with FP4 support.

🏁 Citation

If you use this work, please cite:

Firworks (2025). GPT-OSS 120B NVFP4: Blackwell FP4 Quantized Model via ModelOpt PTQ. Hugging Face Hub.

Downloads last month: -

Safetensors

Model size

59B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Firworks/gpt-oss-120b-nvfp4

Base model

openai/gpt-oss-120b

Quantized

(56)

this model

Evaluation results

perplexity on wikitext-103 (v1)
self-reported

null

View on Papers With Code