Behemoth-123B-NVFP4

NVFP4 quantized version of TheDrummer/Behemoth-123B-v2.2 optimized for NVIDIA DGX/Hopper+ architectures.

Quantization Details

vllm serve tbhot3ww/Behemoth-123B-NVFP4 \
  --quantization modelopt_fp4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95

Sequences	Throughput	Per-Seq	KV Cache	Notes
12	32.4 tok/s	2.7	1.6%	No queuing
64	166.4 tok/s	2.6	9.5%	Linear scaling
128	307.1 tok/s	2.4	21.3%	Sweet spot
256	485.8 tok/s	1.9	45.4%	Pre-queue limit
512	665.5 tok/s	1.3	88.9%	Near capacity
768	768 (peak) / 424 (avg)	0.6	100%	Queued batching, 6m2s

All tests: 200 tokens per sequence

See base model card for architecture details, training data, and usage guidelines.

Safetensors

Model size

62B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Quantized

(8)

this model