GGUF not working in Ollama

#3
by 32jochen - opened

If I try to run the model in Ollama, the following error occurs. I rebuilt the GGUF on my own, but the same error occurs when directly running the provided GGUF:

Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: loaded meta data with 31 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/>
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 0: general.architecture str = qwen3vlmoe
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 1: general.type str = model
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 2: general.name str = Model
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 3: general.size_label str = 128x1.8B
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 4: general.license str = apache-2.0
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 5: general.tags arr[str,1] = ["image-text-to-text"]
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 6: qwen3vlmoe.block_count u32 = 48
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 7: qwen3vlmoe.context_length u32 = 262144
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 8: qwen3vlmoe.embedding_length u32 = 2048
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 9: qwen3vlmoe.feed_forward_length u32 = 6144
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 10: qwen3vlmoe.attention.head_count u32 = 32
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 11: qwen3vlmoe.attention.head_count_kv u32 = 4
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 12: qwen3vlmoe.rope.freq_base f32 = 5000000.000000
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 13: qwen3vlmoe.attention.layer_norm_rms_epsilon f32 = 0.000001
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 14: qwen3vlmoe.expert_used_count u32 = 8
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 15: qwen3vlmoe.attention.key_length u32 = 128
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 16: qwen3vlmoe.attention.value_length u32 = 128
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 17: general.file_type u32 = 7
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 18: qwen3vlmoe.expert_count u32 = 128
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 19: qwen3vlmoe.expert_feed_forward_length u32 = 768
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 20: general.quantization_version u32 = 2
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", ">
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1>
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n",>
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151643
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 151643
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - kv 30: tokenizer.chat_template str = {%- set image_count = n>
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - type f32: 241 tensors
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_loader: - type q8_0: 338 tensors
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: print_info: file format = GGUF V3 (latest)
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: print_info: file type = Q8_0
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: print_info: file size = 30.25 GiB (8.51 BPW)
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3vlmoe'
Oct 10 12:44:46 nico-ThinkPad-P16-Gen-2 ollama[2274]: llama_model_load_from_file_impl: failed to load model

Is this a bug or do I have to modify something in Ollama?

I encountered this error when using both Ollama and Llama.cpp.

Not working with ollama

error loading model: error loading model architecture: unknown model architecture: 'qwen3vlmoe'
Not working with LLMstudio

The models won't work in Ollama and LM Studio due to it being built on a separate branch that's not merged into llama.cpp main.
Once this branch has been merged, and ollama rips the runtimes, and LM Studio updates their engine, they'll work.

Sign up or log in to comment