model_q4f16.onnx running issue

#4
by uday610 - opened

Hi,

I was able to run model.onnx, model_fp16.onnx, and model_q4.onnx using the sample ONNX code (with a minor change for model_fp16.onnx).

Both model.onnx and its quantized version model_q4.onnx run correctly with the sample ONNX Runtime code provided in the model card.

For model_fp16.onnx, I needed to change the past_cache_values dtype to np.float16:

for i in range(num_hidden_layers):
  if layer_types[i] == 'full_attention':
    for kv in ('key', 'value'):
      past_cache_values[f'past_key_values.{i}.{kv}'] = np.zeros(
          [batch_size, num_key_value_heads, 0, head_dim], dtype=np.float16
      )
  elif layer_types[i] == 'conv':
    past_cache_values[f'past_conv.{i}'] = np.zeros(
        [batch_size, hidden_size, conv_L_cache], dtype=np.float16
      )
  else:
    raise ValueError(f"Unsupported layer type: {layer_types[i]}")

However, model_q4f16.onnx always produces random output, and this happens across all three models (350M, 700M, 1.2B) I tried.

Could you please confirm if model_q4f16.onnx requires any special adjustments?

Thanks,

Sign up or log in to comment