iq2_ks is the one of the smartest model that i every tested.

#4
by gopi87 - opened

it performed well in my basic coding test and reply also very similar to some of biggest closed coding model and below is how i run the model and numa is 0 in my case.

CUDA_VISIBLE_DEVICES="0" ./bin/llama-server --model "/home/gopi/Qwen3-480B-A35B-Instruct-IQ2_KS-00001-of-00004.gguf" --ctx-size 12144 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 200 -ub 200 --parallel 1 --threads 52 --threads-batch 52 --temp 0.7 -ser 8,1 --min-p 0.01 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080

@gopi87

Hey thanks so much for testing this stuff out and giving a report! Glad to hear as I just uploaded a number of even "smarter" versions for folks with enough RAM+VRAM!

Regarding your command:

  • no need for -amb 512 as this is not an MLA model. it does not hurt anything though, just not needed here.
  • -ser 8,1 isn't that the default already of 8 experts for this model? i'd have to double check, might be able to leave that off
  • -ub 200 what? have you tried -ub 4096 -b 4096 --no-mmap for sometimes 1.5-3x PP gains? (leave off -rtr if u try that)
  • If you have enough VRAM you can offload more layers, it is different than DeepSeek/Kimi as there are more ffn layers than just exps so look at model card example and get some more layers on GPU for faster TG.

Cheers!

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 15.428 265.48 142.048 7.21
4096 1024 4096 17.450 234.73 151.993 6.74
4096 1024 8192 20.683 198.03 162.897 6.29
4096 1024 12288 23.344 175.46 173.802 5.89
4096 1024 16384 25.789 158.82 185.189 5.53
4096 1024 20480 27.956 146.51 197.602 5.18
4096 1024 24576 31.103 131.69 213.177 4.80
4096 1024 28672 30.260 135.36 228.467 4.48

Consumer DDR4 system.
rtr and no-mmap don't seem to make much of a difference for me.
PP was sub-100 with mainline.
I can't say much about quality yet, I've only been downloading, compiling, sweep benchmarking today.

Thanks again Ubergarm!

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 15.428 265.48 142.048 7.21
4096 1024 4096 17.450 234.73 151.993 6.74
4096 1024 8192 20.683 198.03 162.897 6.29
4096 1024 12288 23.344 175.46 173.802 5.89
4096 1024 16384 25.789 158.82 185.189 5.53
4096 1024 20480 27.956 146.51 197.602 5.18
4096 1024 24576 31.103 131.69 213.177 4.80
4096 1024 28672 30.260 135.36 228.467 4.48

Consumer DDR4 system.
rtr and no-mmap don't seem to make much of a difference for me.
PP was sub-100 with mainline.
I can't say much about quality yet, I've only been downloading, compiling, sweep benchmarking today.

Thanks again Ubergarm!

for my coding task and general use it was really great one i am testing other quants too

@gopi87

Hey thanks so much for testing this stuff out and giving a report! Glad to hear as I just uploaded a number of even "smarter" versions for folks with enough RAM+VRAM!

Regarding your command:

  • no need for -amb 512 as this is not an MLA model. it does not hurt anything though, just not needed here.
  • -ser 8,1 isn't that the default already of 8 experts for this model? i'd have to double check, might be able to leave that off
  • -ub 200 what? have you tried -ub 4096 -b 4096 --no-mmap for sometimes 1.5-3x PP gains? (leave off -rtr if u try that)
  • If you have enough VRAM you can offload more layers, it is different than DeepSeek/Kimi as there are more ffn layers than just exps so look at model card example and get some more layers on GPU for faster TG.

Cheers!

above 4k and no mmap quality is not good

above 4k and no mmap quality is not good

correct, i recommend not to go above above -ub 4096 -b 4096 but those values are good for fast PP.

the --no-mmap just keeps all the weights in THP transparent huge pages sometimes, but usually doesn't effect speed especially on more consumer rigs.. some servers may beneifit from it but would likely require to flush caches so the model loads less fragmented. this is mostly speculation lol...

but u guys both have good speeds and glad the model is working for ya!

I haven't had a chance to try your quants but I wanted to share my appreciaion for ubergarm's timely and important work.

Thank you, good being of light :)

Sign up or log in to comment