Can I run this locally?

#4
by nvriese - opened

Joking obviously, actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

when GGUFs come we can run locally just like Kimi K2

Joking obviously, actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

I was about to say... lol

Not sure why you're joking, once GGUFs land for llama.cpp and possibly ik_llama.cpp (i may take a crack at quantizing this one) you can likely run it on similar hardware as Kimi-K2 but slower given more active parameters.

@nvriese

actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

I would not recommend running it at full bf16 as it is designed to operate at fp8. fp8 = 8bpw * 50B parameters = 50GB active weights per token. Assuming you're running this at say a 4bpw quant with a ~512GB RAM + 48GB VRAM rig or something you could do hybrid inference.

I didn't do the mat yet but if you quantize attn/shexp/first N dense layers (assuming it is similar arch as deepseek / kimi-k2) at 6-8bpw and smash the routed experts to like 2-4bpw i'm guessing active weights size goes down to like 20GB maybe.

So seems like it might be a bit out of reach of a single 24GB VRAM GPU rig with much usable context length anyway. Guessing even with good DDR5-6400MT/s or faster RAM maybe if your rig can hit 512GB theoretical bandwidth in a single NUMA node you might get like 100 tok/sec PP and 10 tok/sec TG on a good day lol.. just spitballing...

please take a crack at it Ubergarm. Your ggufs are really solid!! Would love to test these if you get them going. The biggest I can fit is Q3 but that should be good enough to get a solid idea about this model. Even Q2 Kimi K2 is surprisingly good and on the polyglot it benchmark within 1pnt of q4 i believe

@Fernanda24

No promises but making some progress on Ling-1T-GGUF with new PR on ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp/pull/837#issuecomment-3413794264 . I hope hugging face lets me upload the files given the recent changes on public storage limits!

Sign up or log in to comment