--- language: en license: apache-2.0 tags: - nvfp4 - blackwell - modelopt - quantized - gpt-oss - ptq - nvidia - huggingface model-index: - name: GPT-OSS 120B NVFP4 results: - task: type: text-generation name: Causal Language Modeling dataset: name: wikitext-103 (v1) type: wikitext metrics: - name: perplexity type: perplexity value: null base_model: - openai/gpt-oss-120b --- Having trouble getting this running. Maybe skip until this warning is gone. I'm not sure I quantized this right. # GPT-OSS 120B β€” NVFP4 (Blackwell FP4) Quantized Quantized end-to-end using NVIDIA ModelOpt 0.37.0. This repository contains a **post-training quantized (PTQ)** version of the open-source GPT-OSS 120B model, converted into **NVFP4**, the new 4-bit floating-point format introduced for NVIDIA Blackwell-series GPUs (RTX 5090 / Pro 6000 / B200 / DGX Spark). --- ## πŸ“¦ Model Summary | Property | Value | |-----------|-------| | **Base Model** | [GPT-OSS 120B](https://huggingface.co/gpt-oss/gpt-oss-120b) | | **Quantization Algorithm** | `NVFP4` | | **Quantization Method** | Post-Training Quantization (PTQ) via NVIDIA ModelOpt v0.37.0 | | **Group Size** | 16 | | **Excluded Modules** | `lm_head` (kept BF16 for generation stability) | | **KV Cache Quantization** | None (not yet supported in ModelOpt PTQ) | | **Calibration Dataset** | Wikitext-103 (V1) β€” 1024 samples, 512 tokens each | | **Export Format** | Hugging Face Transformers checkpoint with `.safetensors` | | **License** | Apache 2.0 (inherited from original model) | --- ## 🧠 What Is NVFP4? **NVFP4** is NVIDIA’s new 4-bit floating-point tensor format introduced with the **Blackwell** architecture. It offers ~2Γ— the throughput of FP8 and up to 4Γ— memory savings versus BF16, with higher dynamic range than traditional INT4. This model uses the official **ModelOpt NVFP4 kernel path**, so all weights and scales are compatible with: - **TensorRT-LLM 0.12+** - **NVIDIA Transformer Engine 3.0+** - Future **Hugging Face Transformers FP4** runtimes. --- ## πŸ§ͺ Quantization Process Calibration was performed with forward passes only (no retraining). Weights were quantized per-channel, group size 16, using round-to-nearest NVFP4 encoding. `lm_head` was left unquantized to avoid generation degradation. --- ## βš™οΈ File Structure ``` config.json generation_config.json hf_quant_config.json model-00001-of-00015.safetensors ... model-00015-of-00015.safetensors ``` All tensors are stored in `.safetensors` format. The `hf_quant_config.json` confirms: ```json { "producer": {"name": "modelopt", "version": "0.37.0"}, "quantization": {"quant_algo": "NVFP4", "group_size": 16, "exclude_modules": ["lm_head"]} } ``` --- ## 🧩 Usage Examples ### With Hugging Face Transformers (coming soon) When FP4 support lands in Transformers v5: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "Firworks/gpt-oss-120b-nvfp4", torch_dtype="fp4", # future alias for NVFP4 device_map="auto" ) tok = AutoTokenizer.from_pretrained("gpt-oss/gpt-oss-120b") prompt = "Explain why NVFP4 quantization matters:" out = model.generate(**tok(prompt, return_tensors="pt")) print(tok.decode(out[0], skip_special_tokens=True)) ``` ### With TensorRT-LLM 0.12+ ```bash trtllm-build --checkpoint-dir ./gpt-oss-120b-nvfp4 --dtype fp4 trtllm-bench --model ./build ``` --- ## ⚑ Performance & Memory Footprint | Precision | Disk Size | Speed (Blackwell FP4 Core) | Notes | | ------------------- | --------- | -------------------------- | ------------------------ | | **BF16 (original)** | ~180 GB | 1Γ— baseline | Reference | | **NVFP4 (this)** | ~70 GB | β‰ˆ 4–6Γ— faster | Native NVFP4 tensor path | --- ## πŸ“œ License & Attribution * **Base Model:** [GPT-OSS 120B](https://huggingface.co/gpt-oss/gpt-oss-120b) Β© 2024 The GPT-OSS Contributors, licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). * **Quantized Variant:** Produced by [Firworks](https://huggingface.co/Firworks) using NVIDIA ModelOpt 0.37.0. This derived work is also released under **Apache 2.0**. ``` Copyright 2025 Firworks Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 ``` --- ## 🧭 Notes * Quantization performed entirely offline using CPU-fallback PTQ. * Intended primarily for experimentation, validation, and early NVFP4 benchmarks. * Inference requires **Blackwell-generation GPUs** or newer with FP4 support. --- ### 🏁 Citation If you use this work, please cite: > Firworks (2025). *GPT-OSS 120B NVFP4: Blackwell FP4 Quantized Model via ModelOpt PTQ.* Hugging Face Hub. --- ```