---
pipeline_tag: text-generation
base_model:
- deepseek-ai/DeepSeek-R1-0528
license: mit
library_name: Model Optimizer
tags:
- nvidia
- ModelOpt
- DeepSeekR1
- quantized
- FP4
---
# Model Overview
## Description:
The NVIDIA DeepSeek-R1-0528-FP4 model is the quantized version of the DeepSeek AI's DeepSeek R1 0528 model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528). The NVIDIA DeepSeek R1 FP4 model is quantized with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer).
This model is ready for commercial/non-commercial use.
## Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA [(DeepSeek R1) Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528).
### License/Terms of Use:
[MIT](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md)
## Model Architecture:
**Architecture Type:** Transformers
**Network Architecture:** DeepSeek R1
## Input:
**Input Type(s):** Text
**Input Format(s):** String
**Input Parameters:** 1D (One Dimensional): Sequences
**Other Properties Related to Input:** DeepSeek recommends adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance: \
- Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
- Avoid adding a system prompt; all instructions should be contained within the user prompt.
- For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
- When evaluating model performance, it is recommended to conduct multiple tests and average the results.
## Output:
**Output Type(s):** Text
**Output Format:** String
**Output Parameters:** 1D (One Dimensional): Sequences
## Software Integration:
**Supported Runtime Engine(s):**
* TensorRT-LLM
**Supported Hardware Microarchitecture Compatibility:**
* NVIDIA Blackwell
**Preferred Operating System(s):**
* Linux
## Model Version(s):
** The model is quantized with nvidia-modelopt **v0.31.0**
## Training Dataset:
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
## Testing Dataset:
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
## Evaluation Dataset:
** Data Collection Method by dataset: Hybrid: Human, Automated
** Labeling Method by dataset: Hybrid: Human, Automated
## Calibration Datasets:
* Calibration Dataset: [cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail)
** Data collection method: Automated.
** Labeling method: Undisclosed.
## Inference:
**Engine:** TensorRT-LLM
**Test Hardware:** B200
## Post Training Quantization
This model was obtained by quantizing the weights and activations of DeepSeek R1 to FP4 data type, ready for inference with TensorRT-LLM. Only the weights and activations of the linear operators within transformer blocks are quantized. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 1.6x.
## Usage
### Deploy with TensorRT-LLM
To deploy the quantized FP4 checkpoint with [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) LLM API, follow the sample codes below (you need 8xB200 GPU and TensorRT LLM built from source with the latest main branch):
* LLM API sample usage:
```
from tensorrt_llm import SamplingParams
from tensorrt_llm._torch import LLM
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(max_tokens=32)
llm = LLM(model="nvidia/DeepSeek-R1-0528-FP4", tensor_parallel_size=8, enable_attention_dp=True)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
# The entry point of the program need to be protected for spawning processes.
if __name__ == '__main__':
main()
```
### Evaluation
The accuracy benchmark results are presented in the table below:
Precision | MMLU Pro | GPQA Diamond | LiveCodeBench | SCICODE | MATH-500 | AIME 2024 |
FP8 (AA Ref) | 85 | 81 | 77 | 40 | 98 | 89 |
FP4 | 84.2 | 80.0 | 76.3 | 40.1 | 98.1 | 91.3 |