GPT-OSS 120B β€” NVFP4 (Blackwell FP4) Quantized

Quantized end-to-end using NVIDIA ModelOpt 0.37.0.

This repository contains a post-training quantized (PTQ) version of the open-source GPT-OSS 120B model, converted into NVFP4, the new 4-bit floating-point format introduced for NVIDIA Blackwell-series GPUs (RTX 5090 / Pro 6000 / B200 / DGX Spark).


πŸ“¦ Model Summary

Property Value
Base Model GPT-OSS 120B
Quantization Algorithm NVFP4
Quantization Method Post-Training Quantization (PTQ) via NVIDIA ModelOpt v0.37.0
Group Size 16
Excluded Modules lm_head (kept BF16 for generation stability)
KV Cache Quantization None (not yet supported in ModelOpt PTQ)
Calibration Dataset Wikitext-103 (V1) β€” 1024 samples, 512 tokens each
Export Format Hugging Face Transformers checkpoint with .safetensors
License Apache 2.0 (inherited from original model)

🧠 What Is NVFP4?

NVFP4 is NVIDIA’s new 4-bit floating-point tensor format introduced with the Blackwell architecture.
It offers ~2Γ— the throughput of FP8 and up to 4Γ— memory savings versus BF16, with higher dynamic range than traditional INT4.
This model uses the official ModelOpt NVFP4 kernel path, so all weights and scales are compatible with:

  • TensorRT-LLM 0.12+
  • NVIDIA Transformer Engine 3.0+
  • Future Hugging Face Transformers FP4 runtimes.

πŸ§ͺ Quantization Process

Calibration was performed with forward passes only (no retraining). Weights were quantized per-channel, group size 16, using round-to-nearest NVFP4 encoding. lm_head was left unquantized to avoid generation degradation.


βš™οΈ File Structure

config.json
generation_config.json
hf_quant_config.json
model-00001-of-00015.safetensors
...
model-00015-of-00015.safetensors

All tensors are stored in .safetensors format. The hf_quant_config.json confirms:

{
  "producer": {"name": "modelopt", "version": "0.37.0"},
  "quantization": {"quant_algo": "NVFP4", "group_size": 16, "exclude_modules": ["lm_head"]}
}

🧩 Usage Examples

With Hugging Face Transformers (coming soon)

When FP4 support lands in Transformers v5:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Firworks/gpt-oss-120b-nvfp4",
    torch_dtype="fp4",          # future alias for NVFP4
    device_map="auto"
)
tok = AutoTokenizer.from_pretrained("gpt-oss/gpt-oss-120b")

prompt = "Explain why NVFP4 quantization matters:"
out = model.generate(**tok(prompt, return_tensors="pt"))
print(tok.decode(out[0], skip_special_tokens=True))

With TensorRT-LLM 0.12+

trtllm-build --checkpoint-dir ./gpt-oss-120b-nvfp4 --dtype fp4
trtllm-bench --model ./build

⚑ Performance & Memory Footprint

Precision Disk Size Speed (Blackwell FP4 Core) Notes
BF16 (original) ~180 GB 1Γ— baseline Reference
NVFP4 (this) ~70 GB β‰ˆ 4–6Γ— faster Native NVFP4 tensor path

πŸ“œ License & Attribution

  • Base Model: GPT-OSS 120B Β© 2024 The GPT-OSS Contributors, licensed under Apache 2.0.
  • Quantized Variant: Produced by Firworks using NVIDIA ModelOpt 0.37.0. This derived work is also released under Apache 2.0.
Copyright 2025 Firworks

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0

🧭 Notes

  • Quantization performed entirely offline using CPU-fallback PTQ.
  • Intended primarily for experimentation, validation, and early NVFP4 benchmarks.
  • Inference requires Blackwell-generation GPUs or newer with FP4 support.

🏁 Citation

If you use this work, please cite:

Firworks (2025). GPT-OSS 120B NVFP4: Blackwell FP4 Quantized Model via ModelOpt PTQ. Hugging Face Hub.



Downloads last month
-
Safetensors
Model size
59B params
Tensor type
BF16
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Firworks/gpt-oss-120b-nvfp4

Quantized
(56)
this model

Evaluation results