Can I run this locally?

by nvriese - opened 7 days ago

Discussion

nvriese

7 days ago

•

edited 7 days ago

Joking obviously, actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

Fernanda24

6 days ago

when GGUFs come we can run locally just like Kimi K2

owenqwenllmwine

5 days ago

Joking obviously, actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

I was about to say... lol

ubergarm

3 days ago

Not sure why you're joking, once GGUFs land for llama.cpp and possibly ik_llama.cpp (i may take a crack at quantizing this one) you can likely run it on similar hardware as Kimi-K2 but slower given more active parameters.

@nvriese

actual question: what's the VRAM needed for inference at BF16 precision with 50B active parameters?

I would not recommend running it at full bf16 as it is designed to operate at fp8. fp8 = 8bpw * 50B parameters = 50GB active weights per token. Assuming you're running this at say a 4bpw quant with a ~512GB RAM + 48GB VRAM rig or something you could do hybrid inference.

I didn't do the mat yet but if you quantize attn/shexp/first N dense layers (assuming it is similar arch as deepseek / kimi-k2) at 6-8bpw and smash the routed experts to like 2-4bpw i'm guessing active weights size goes down to like 20GB maybe.

So seems like it might be a bit out of reach of a single 24GB VRAM GPU rig with much usable context length anyway. Guessing even with good DDR5-6400MT/s or faster RAM maybe if your rig can hit 512GB theoretical bandwidth in a single NUMA node you might get like 100 tok/sec PP and 10 tok/sec TG on a good day lol.. just spitballing...

Fernanda24

1 day ago

please take a crack at it Ubergarm. Your ggufs are really solid!! Would love to test these if you get them going. The biggest I can fit is Q3 but that should be good enough to get a solid idea about this model. Even Q2 Kimi K2 is surprisingly good and on the polyglot it benchmark within 1pnt of q4 i believe

ubergarm

about 6 hours ago

@Fernanda24

No promises but making some progress on Ling-1T-GGUF with new PR on ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp/pull/837#issuecomment-3413794264 . I hope hugging face lets me upload the files given the recent changes on public storage limits!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment