deepseek-ai/DeepSeek-R1-0528-Qwen3-8B · Failed to reproduce evaluation result on AIME24

Tried to evaluate this model on AIME24, using the lighteval evaluation tool. The evaluation result is approximately 81 on AIME24 (I have already modified the lighteval code to evaluate using pass@1_16_samples). Here is my evaluation script.

# 模型路径
MODELS="/local_path_to_model/DeepSeek-R1-0528-Qwen3-8B/"

# 模型参数模板
MODEL_ARGS_TEMPLATE="model_name=%s,trust_remote_code=True,dtype=bfloat16,max_num_batched_tokens=65536,max_model_length=65536,gpu_memory_utilization=0.95,data_parallel_size=1,tensor_parallel_size=4,generation_parameters={max_new_tokens:65000,temperature:0.6,top_p:0.95}"
MODEL_ARGS=$(printf "$MODEL_ARGS_TEMPLATE" "$MODEL")
CUDA_VISIBLE_DEVICES="0,1,2,3" \
    lighteval vllm "$MODEL_ARGS"  "lighteval|aime24|0|0" \
    --use-chat-template \
    --save-details \
    --output-dir "evaluation_results"

Other info

vllm 0.9.1
transformers 4.51.3
torch 2.7.0

Are there any suggestions to reproduce the same evaluation result?