Behemoth-123B-NVFP4
NVFP4 quantized version of TheDrummer/Behemoth-123B-v2.2 optimized for NVIDIA DGX/Hopper+ architectures.
Quantization Details
- Format: NVFP4 (4-bit floating point)
- Quantized using: NVIDIA TensorRT Model Optimizer 0.35.0
- Hardware: 2× NVIDIA H200 SXM (188GB each)
- Original size: 245GB (BF16) → 66GB (NVFP4)
- Compatible with: vLLM v0.10+, NVIDIA NGC containers
Usage
vllm serve tbhot3ww/Behemoth-123B-NVFP4 \
--quantization modelopt_fp4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95
Performance Benchmarks (NVIDIA GB10, 128GB unified memory)
| Sequences | Throughput | Per-Seq | KV Cache | Notes |
|---|---|---|---|---|
| 12 | 32.4 tok/s | 2.7 | 1.6% | No queuing |
| 64 | 166.4 tok/s | 2.6 | 9.5% | Linear scaling |
| 128 | 307.1 tok/s | 2.4 | 21.3% | Sweet spot |
| 256 | 485.8 tok/s | 1.9 | 45.4% | Pre-queue limit |
| 512 | 665.5 tok/s | 1.3 | 88.9% | Near capacity |
| 768 | 768 (peak) / 424 (avg) | 0.6 | 100% | Queued batching, 6m2s |
All tests: 200 tokens per sequence
Original Model
See base model card for architecture details, training data, and usage guidelines.
- Downloads last month
- 55
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for tbhot3ww/Behemoth-123B-NVFP4
Base model
TheDrummer/Behemoth-123B-v2.2