Failed to reproduce evaluation result on AIME24
#18
by
cppowboy
- opened
Tried to evaluate this model on AIME24, using the lighteval evaluation tool. The evaluation result is approximately 81 on AIME24 (I have already modified the lighteval code to evaluate using pass@1_16_samples). Here is my evaluation script.
# 模型路径
MODELS="/local_path_to_model/DeepSeek-R1-0528-Qwen3-8B/"
# 模型参数模板
MODEL_ARGS_TEMPLATE="model_name=%s,trust_remote_code=True,dtype=bfloat16,max_num_batched_tokens=65536,max_model_length=65536,gpu_memory_utilization=0.95,data_parallel_size=1,tensor_parallel_size=4,generation_parameters={max_new_tokens:65000,temperature:0.6,top_p:0.95}"
MODEL_ARGS=$(printf "$MODEL_ARGS_TEMPLATE" "$MODEL")
CUDA_VISIBLE_DEVICES="0,1,2,3" \
lighteval vllm "$MODEL_ARGS" "lighteval|aime24|0|0" \
--use-chat-template \
--save-details \
--output-dir "evaluation_results"
Other info
vllm 0.9.1
transformers 4.51.3
torch 2.7.0
Are there any suggestions to reproduce the same evaluation result?
cppowboy
changed discussion status to
closed