cuda中的 yarn 外扩好像无法使用，一直只有32k上下文

by houxiaowei - opened Sep 26

Sep 26

.\llama-server.exe -m ....\Ling-mini-2.0-Q4_K_M.gguf -c 133072 -fa 1 -a Ling-mini-2.0 --jinja --rope-scaling yarn --yarn-orig-ctx 32768
PS D:\model\llama-ling\llama-b6570-bin-win-cuda-12.4-x64> .\llama-server.exe -m ....\Ling-mini-2.0-Q4_K_M.gguf -c 133072 -fa 1 -a Ling-mini-2.0 --jinja --rope-scaling yarn --yarn-orig-ctx 32768
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080 with Max-Q Design, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from D:\model\llama-ling\llama-b6570-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from D:\model\llama-ling\llama-b6570-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\model\llama-ling\llama-b6570-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
build: 6570 (58fb8dfc) with clang version 19.1.5 for x86_64-pc-windows-msvc
system info: n_threads = 6, n_threads_batch = 6, total_threads = 12

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 11
main: loading model
srv load_model: loading model '....\Ling-mini-2.0-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2080 with Max-Q Design) (0000:01:00.0) - 15270 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 278 tensors from ....\Ling-mini-2.0-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bailingmoe2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Ling Mini 2.0
llama_model_loader: - kv 3: general.version str = 2.0
llama_model_loader: - kv 4: general.basename str = Ling
llama_model_loader: - kv 5: general.size_label str = mini
llama_model_loader: - kv 6: general.license str = MIT License
llama_model_loader: - kv 7: bailingmoe2.block_count u32 = 20
llama_model_loader: - kv 8: bailingmoe2.context_length u32 = 32768
llama_model_loader: - kv 9: bailingmoe2.embedding_length u32 = 2048
llama_model_loader: - kv 10: bailingmoe2.feed_forward_length u32 = 5120
llama_model_loader: - kv 11: bailingmoe2.attention.head_count u32 = 16
llama_model_loader: - kv 12: bailingmoe2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: bailingmoe2.rope.freq_base f32 = 600000.000000
llama_model_loader: - kv 14: bailingmoe2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: bailingmoe2.expert_used_count u32 = 8
llama_model_loader: - kv 16: bailingmoe2.attention.key_length u32 = 128
llama_model_loader: - kv 17: bailingmoe2.attention.value_length u32 = 128
llama_model_loader: - kv 18: bailingmoe2.rope.dimension_count u32 = 64
llama_model_loader: - kv 19: bailingmoe2.rope.scaling.type str = none
llama_model_loader: - kv 20: bailingmoe2.leading_dense_block_count u32 = 1
llama_model_loader: - kv 21: bailingmoe2.vocab_size u32 = 157184
llama_model_loader: - kv 22: bailingmoe2.expert_feed_forward_length u32 = 512
llama_model_loader: - kv 23: bailingmoe2.expert_shared_feed_forward_length u32 = 512
llama_model_loader: - kv 24: bailingmoe2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 25: bailingmoe2.expert_count u32 = 256
llama_model_loader: - kv 26: bailingmoe2.expert_shared_count u32 = 1
llama_model_loader: - kv 27: bailingmoe2.expert_group_count u32 = 8
llama_model_loader: - kv 28: bailingmoe2.expert_group_used_count u32 = 4
llama_model_loader: - kv 29: bailingmoe2.expert_weights_norm bool = true
llama_model_loader: - kv 30: bailingmoe2.expert_gating_func u32 = 2
llama_model_loader: - kv 31: bailingmoe2.nextn_predict_layers u32 = 0
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = bailingmoe2
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,157184] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,157184] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,156635] = ["臓臓", "臓 t", "i n", "臓 a", "h e...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 156891
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 156895
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 156892
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 42: tokenizer.chat_template str = {% set thinking_option = 'off' %}\n{{-...
llama_model_loader: - kv 43: general.quantization_version u32 = 2
llama_model_loader: - kv 44: general.file_type u32 = 15
llama_model_loader: - type f32: 119 tensors
llama_model_loader: - type q4_K: 119 tensors
llama_model_loader: - type q5_K: 20 tensors
llama_model_loader: - type q6_K: 20 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 9.22 GiB (4.87 BPW)
load: printing all EOG tokens:
load: - 156892 ('<|endoftext|>')
load: - 156895 ('<|role_end|>')
load: special tokens cache size = 262
load: token to piece cache size = 1.0010 MB
print_info: arch = bailingmoe2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 2048
print_info: n_layer = 20
print_info: n_head = 16
print_info: n_head_kv = 4
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 5120
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = none
print_info: freq_base_train = 600000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = 16B.A1B
print_info: model params = 16.26 B
print_info: general.name = Ling Mini 2.0
print_info: n_layer_dense_lead = 1
print_info: n_ff_exp = 512
print_info: n_ff_shexp = 512
print_info: n_expert_shared = 1
print_info: n_expert_groups = 8
print_info: n_group_exp = 4
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: nextn_predict_layers = 0
print_info: vocab type = BPE
print_info: n_vocab = 157184
print_info: n_merges = 156635
print_info: BOS token = 156891 '<|startoftext|>'
print_info: EOS token = 156895 '<|role_end|>'
print_info: EOT token = 156892 '<|endoftext|>'
print_info: PAD token = 156892 '<|endoftext|>'
print_info: LF token = 198 '膴'
print_info: EOG token = 156892 '<|endoftext|>'
print_info: EOG token = 156895 '<|role_end|>'
print_info: max token length = 154
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 20 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 21/21 layers to GPU
load_tensors: CPU_Mapped model buffer size = 172.69 MiB
load_tensors: CUDA0 model buffer size = 9273.53 MiB
.............................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 133072
llama_context: n_ctx_per_seq = 133072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = false
llama_context: freq_base = 600000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (133072) > n_ctx_train (32768) -- possible training context overflow
llama_context: CUDA_Host output buffer size = 0.60 MiB
llama_kv_cache: CUDA0 KV buffer size = 5200.00 MiB
llama_kv_cache: size = 5200.00 MiB (133120 cells, 20 layers, 1/1 seqs), K (f16): 2600.00 MiB, V (f16): 2600.00 MiB
llama_context: CUDA0 compute buffer size = 398.01 MiB
llama_context: CUDA_Host compute buffer size = 264.01 MiB
llama_context: graph nodes = 1313
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|role_end|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 133120
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 133120
srv init: Enable thinking? 0
main: model loaded
main: chat template, chat_template: {% set thinking_option = 'off' %}
{{- 'SYSTEM' }}
{%- if messages[0].role == 'system' %}
{{- messages[0].content + '\n' }}
{%- endif %}
{%- if tools %}
{{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson }}
{%- endfor %}
{{- "\n\n\nFor each function call, return a json object with function name and arguments within XML tags:\n\n{"name": , "arguments": }\n\n" }}
{%- endif %}
{{- 'detailed thinking ' + thinking_option + '<|role_end|>' }}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('') and message.content.endswith('')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- if message.content is string %}
{%- set content = message.content %}
{%- else %}
{%- set content = '' %}
{%- endif %}
{%- if message.role == "user" %}
{{- 'HUMAN' + message.content + '<|role_end|>' }}
{%- elif message.role == "system" and not loop.first %}
{{- 'SYSTEM' + message.content + '<|role_end|>' }}
{%- elif message.role == "assistant" %}
{%- set reasoning_content = '' %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = message.reasoning_content %}
{%- else %}
{%- if '' in content %}
{%- set reasoning_content = content.split('')[0].rstrip('\n').split('')[-1].lstrip('\n') %}
{%- set content = content.split('')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index %}
{%- if reasoning_content %}
{{- 'ASSISTANT' + '\n\n' + reasoning_content.strip('\n') + '\n\n\n' + content.lstrip('\n') }}
{%- else %}
{{- 'ASSISTANT' + content }}
{%- endif %}
{%- else %}
{{- 'ASSISTANT' + content }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if (loop.first and content) or (not loop.first) %}
{{- '\n' }}
{%- endif %}
{%- if tool_call.function %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '\n{"name": "' }}
{{- tool_call.name }}
{{- '", "arguments": ' }}
{%- if tool_call.arguments is string %}
{{- tool_call.arguments }}
{%- else %}
{{- tool_call.arguments | tojson }}
{%- endif %}
{{- '}\n' }}
{%- endfor %}
{%- endif %}
{{- '<|role_end|>' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- 'OBSERVATION' }}
{%- endif %}
{{- '\n\n' }}
{{- content }}
{{- '\n' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|role_end|>' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- 'ASSISTANT' }}
{%- endif %}, example_format: 'SYSTEMYou are a helpful assistant
detailed thinking off<|role_end|>HUMANHello<|role_end|>ASSISTANTHi there<|role_end|>HUMANHow are you?<|role_end|>ASSISTANT'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
srv operator(): operator(): cleaning up before exit...
Received second interrupt, terminating immediately.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment