--- license: mit base_model: - ByteDance-Seed/Seed-OSS-36B-Instruct datasets: - HuggingFaceH4/ultrachat_200k pipeline_tag: text-generation tags: - vllm - llmcompressor - text-generation-inference --- # Seed-OSS-36B-Instruct FP8 quantization (including KV-cache) This repo contains Seed-OSS-36B-Instruct quantized with FP8, and FP8 KV-cache Original Model: - [ByteDance-Seed/Seed-OSS-36B-Instruct](https://huggingface.co/ByteDance-Seed/Seed-OSS-36B-Instruct) ⚠️⚠️⚠️ This is at the moment a debugging upload. It will trigger a vllm assert as some scaling factors are not >0.0 if used with FP8 KV-cache (`--kv-cache-dtype='fp8'`) ## 📥 Usage & Running Instructions The model was tested with vLLM and 2x RTX Pro 6000, here is a script suitable for such configuration. ``` export MODEL="mratsim/Seed-OSS-36B-Instruct-FP8-KV8" vllm serve "${MODEL}" \ --served-model-name seed-oss-36b \ --tensor-parallel-size 2 \ --kv-cache-dtype 'fp8' \ --gpu-memory-utilization 0.85 ```