iq2_ks is the one of the smartest model that i every tested.

by gopi87 - opened 16 days ago

16 days ago

it performed well in my basic coding test and reply also very similar to some of biggest closed coding model and below is how i run the model and numa is 0 in my case.

CUDA_VISIBLE_DEVICES="0" ./bin/llama-server --model "/home/gopi/Qwen3-480B-A35B-Instruct-IQ2_KS-00001-of-00004.gguf" --ctx-size 12144 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 200 -ub 200 --parallel 1 --threads 52 --threads-batch 52 --temp 0.7 -ser 8,1 --min-p 0.01 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080

ubergarm

Owner 16 days ago

@gopi87

Hey thanks so much for testing this stuff out and giving a report! Glad to hear as I just uploaded a number of even "smarter" versions for folks with enough RAM+VRAM!

Regarding your command:

no need for -amb 512 as this is not an MLA model. it does not hurt anything though, just not needed here.
-ser 8,1 isn't that the default already of 8 experts for this model? i'd have to double check, might be able to leave that off
-ub 200 what? have you tried -ub 4096 -b 4096 --no-mmap for sometimes 1.5-3x PP gains? (leave off -rtr if u try that)
If you have enough VRAM you can offload more layers, it is different than DeepSeek/Kimi as there are more ffn layers than just exps so look at model card example and get some more layers on GPU for faster TG.

Cheers!

gtkunit

16 days ago

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	15.428	265.48	142.048	7.21
4096	1024	4096	17.450	234.73	151.993	6.74
4096	1024	8192	20.683	198.03	162.897	6.29
4096	1024	12288	23.344	175.46	173.802	5.89
4096	1024	16384	25.789	158.82	185.189	5.53
4096	1024	20480	27.956	146.51	197.602	5.18
4096	1024	24576	31.103	131.69	213.177	4.80
4096	1024	28672	30.260	135.36	228.467	4.48

Consumer DDR4 system.
rtr and no-mmap don't seem to make much of a difference for me.
PP was sub-100 with mainline.
I can't say much about quality yet, I've only been downloading, compiling, sweep benchmarking today.

Thanks again Ubergarm!

gopi87

16 days ago

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s

4096 1024 0 15.428 265.48 142.048 7.21

4096 1024 4096 17.450 234.73 151.993 6.74

4096 1024 8192 20.683 198.03 162.897 6.29

4096 1024 12288 23.344 175.46 173.802 5.89

4096 1024 16384 25.789 158.82 185.189 5.53

4096 1024 20480 27.956 146.51 197.602 5.18

4096 1024 24576 31.103 131.69 213.177 4.80

4096 1024 28672 30.260 135.36 228.467 4.48

Consumer DDR4 system.
rtr and no-mmap don't seem to make much of a difference for me.
PP was sub-100 with mainline.
I can't say much about quality yet, I've only been downloading, compiling, sweep benchmarking today.

Thanks again Ubergarm!

for my coding task and general use it was really great one i am testing other quants too

gopi87

16 days ago

@gopi87

Hey thanks so much for testing this stuff out and giving a report! Glad to hear as I just uploaded a number of even "smarter" versions for folks with enough RAM+VRAM!

Regarding your command:

no need for -amb 512 as this is not an MLA model. it does not hurt anything though, just not needed here.

-ser 8,1 isn't that the default already of 8 experts for this model? i'd have to double check, might be able to leave that off

-ub 200 what? have you tried -ub 4096 -b 4096 --no-mmap for sometimes 1.5-3x PP gains? (leave off -rtr if u try that)

If you have enough VRAM you can offload more layers, it is different than DeepSeek/Kimi as there are more ffn layers than just exps so look at model card example and get some more layers on GPU for faster TG.

Cheers!

above 4k and no mmap quality is not good

ubergarm

Owner 15 days ago

above 4k and no mmap quality is not good

correct, i recommend not to go above above -ub 4096 -b 4096 but those values are good for fast PP.

the --no-mmap just keeps all the weights in THP transparent huge pages sometimes, but usually doesn't effect speed especially on more consumer rigs.. some servers may beneifit from it but would likely require to flush caches so the model loads less fragmented. this is mostly speculation lol...

but u guys both have good speeds and glad the model is working for ya!

BingoBird

2 days ago

I haven't had a chance to try your quants but I wanted to share my appreciaion for ubergarm's timely and important work.

Thank you, good being of light :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment