Behemoth-123B-NVFP4

NVFP4 quantized version of TheDrummer/Behemoth-123B-v2.2 optimized for NVIDIA DGX/Hopper+ architectures.

Quantization Details

  • Format: NVFP4 (4-bit floating point)
  • Quantized using: NVIDIA TensorRT Model Optimizer 0.35.0
  • Hardware: 2× NVIDIA H200 SXM (188GB each)
  • Original size: 245GB (BF16) → 66GB (NVFP4)
  • Compatible with: vLLM v0.10+, NVIDIA NGC containers

Usage

vllm serve tbhot3ww/Behemoth-123B-NVFP4 \
  --quantization modelopt_fp4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95

Performance Benchmarks (NVIDIA GB10, 128GB unified memory)

Sequences Throughput Per-Seq KV Cache Notes
12 32.4 tok/s 2.7 1.6% No queuing
64 166.4 tok/s 2.6 9.5% Linear scaling
128 307.1 tok/s 2.4 21.3% Sweet spot
256 485.8 tok/s 1.9 45.4% Pre-queue limit
512 665.5 tok/s 1.3 88.9% Near capacity
768 768 (peak) / 424 (avg) 0.6 100% Queued batching, 6m2s

All tests: 200 tokens per sequence

Original Model

See base model card for architecture details, training data, and usage guidelines.

Downloads last month
55
Safetensors
Model size
62B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tbhot3ww/Behemoth-123B-NVFP4

Quantized
(8)
this model