Failed to reproduce evaluation result on AIME24

#18
by cppowboy - opened

Tried to evaluate this model on AIME24, using the lighteval evaluation tool. The evaluation result is approximately 81 on AIME24 (I have already modified the lighteval code to evaluate using pass@1_16_samples). Here is my evaluation script.

# 模型路径
MODELS="/local_path_to_model/DeepSeek-R1-0528-Qwen3-8B/"

# 模型参数模板
MODEL_ARGS_TEMPLATE="model_name=%s,trust_remote_code=True,dtype=bfloat16,max_num_batched_tokens=65536,max_model_length=65536,gpu_memory_utilization=0.95,data_parallel_size=1,tensor_parallel_size=4,generation_parameters={max_new_tokens:65000,temperature:0.6,top_p:0.95}"
MODEL_ARGS=$(printf "$MODEL_ARGS_TEMPLATE" "$MODEL")
CUDA_VISIBLE_DEVICES="0,1,2,3" \
    lighteval vllm "$MODEL_ARGS"  "lighteval|aime24|0|0" \
    --use-chat-template \
    --save-details \
    --output-dir "evaluation_results"

Other info

vllm 0.9.1
transformers 4.51.3
torch 2.7.0

Are there any suggestions to reproduce the same evaluation result?

cppowboy changed discussion status to closed

Sign up or log in to comment