model_q4f16.onnx running issue
#4
by
uday610
- opened
Hi,
I was able to run model.onnx
, model_fp16.onnx
, and model_q4.onnx
using the sample ONNX code (with a minor change for model_fp16.onnx
).
Both model.onnx
and its quantized version model_q4.onnx
run correctly with the sample ONNX Runtime code provided in the model card.
For model_fp16.onnx
, I needed to change the past_cache_values
dtype to np.float16
:
for i in range(num_hidden_layers):
if layer_types[i] == 'full_attention':
for kv in ('key', 'value'):
past_cache_values[f'past_key_values.{i}.{kv}'] = np.zeros(
[batch_size, num_key_value_heads, 0, head_dim], dtype=np.float16
)
elif layer_types[i] == 'conv':
past_cache_values[f'past_conv.{i}'] = np.zeros(
[batch_size, hidden_size, conv_L_cache], dtype=np.float16
)
else:
raise ValueError(f"Unsupported layer type: {layer_types[i]}")
However, model_q4f16.onnx
always produces random output, and this happens across all three models (350M, 700M, 1.2B) I tried.
Could you please confirm if model_q4f16.onnx
requires any special adjustments?
Thanks,