GPT-OSS 120B β NVFP4 (Blackwell FP4) Quantized
Quantized end-to-end using NVIDIA ModelOpt 0.37.0.
This repository contains a post-training quantized (PTQ) version of the open-source GPT-OSS 120B model, converted into NVFP4, the new 4-bit floating-point format introduced for NVIDIA Blackwell-series GPUs (RTX 5090 / Pro 6000 / B200 / DGX Spark).
π¦ Model Summary
Property | Value |
---|---|
Base Model | GPT-OSS 120B |
Quantization Algorithm | NVFP4 |
Quantization Method | Post-Training Quantization (PTQ) via NVIDIA ModelOpt v0.37.0 |
Group Size | 16 |
Excluded Modules | lm_head (kept BF16 for generation stability) |
KV Cache Quantization | None (not yet supported in ModelOpt PTQ) |
Calibration Dataset | Wikitext-103 (V1) β 1024 samples, 512 tokens each |
Export Format | Hugging Face Transformers checkpoint with .safetensors |
License | Apache 2.0 (inherited from original model) |
π§ What Is NVFP4?
NVFP4 is NVIDIAβs new 4-bit floating-point tensor format introduced with the Blackwell architecture.
It offers ~2Γ the throughput of FP8 and up to 4Γ memory savings versus BF16, with higher dynamic range than traditional INT4.
This model uses the official ModelOpt NVFP4 kernel path, so all weights and scales are compatible with:
- TensorRT-LLM 0.12+
- NVIDIA Transformer Engine 3.0+
- Future Hugging Face Transformers FP4 runtimes.
π§ͺ Quantization Process
Calibration was performed with forward passes only (no retraining).
Weights were quantized per-channel, group size 16, using round-to-nearest NVFP4 encoding.
lm_head
was left unquantized to avoid generation degradation.
βοΈ File Structure
config.json
generation_config.json
hf_quant_config.json
model-00001-of-00015.safetensors
...
model-00015-of-00015.safetensors
All tensors are stored in .safetensors
format.
The hf_quant_config.json
confirms:
{
"producer": {"name": "modelopt", "version": "0.37.0"},
"quantization": {"quant_algo": "NVFP4", "group_size": 16, "exclude_modules": ["lm_head"]}
}
π§© Usage Examples
With Hugging Face Transformers (coming soon)
When FP4 support lands in Transformers v5:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Firworks/gpt-oss-120b-nvfp4",
torch_dtype="fp4", # future alias for NVFP4
device_map="auto"
)
tok = AutoTokenizer.from_pretrained("gpt-oss/gpt-oss-120b")
prompt = "Explain why NVFP4 quantization matters:"
out = model.generate(**tok(prompt, return_tensors="pt"))
print(tok.decode(out[0], skip_special_tokens=True))
With TensorRT-LLM 0.12+
trtllm-build --checkpoint-dir ./gpt-oss-120b-nvfp4 --dtype fp4
trtllm-bench --model ./build
β‘ Performance & Memory Footprint
Precision | Disk Size | Speed (Blackwell FP4 Core) | Notes |
---|---|---|---|
BF16 (original) | ~180 GB | 1Γ baseline | Reference |
NVFP4 (this) | ~70 GB | β 4β6Γ faster | Native NVFP4 tensor path |
π License & Attribution
- Base Model: GPT-OSS 120B Β© 2024 The GPT-OSS Contributors, licensed under Apache 2.0.
- Quantized Variant: Produced by Firworks using NVIDIA ModelOpt 0.37.0. This derived work is also released under Apache 2.0.
Copyright 2025 Firworks
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
π§ Notes
- Quantization performed entirely offline using CPU-fallback PTQ.
- Intended primarily for experimentation, validation, and early NVFP4 benchmarks.
- Inference requires Blackwell-generation GPUs or newer with FP4 support.
π Citation
If you use this work, please cite:
Firworks (2025). GPT-OSS 120B NVFP4: Blackwell FP4 Quantized Model via ModelOpt PTQ. Hugging Face Hub.
- Downloads last month
- -
Model tree for Firworks/gpt-oss-120b-nvfp4
Base model
openai/gpt-oss-120b