--- license: mit language: - en base_model: - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B pipeline_tag: text-generation tags: - GRPO - DAPO - GCPO - RL - RLVR --- **[GCPO: When Contrast Fails, Go Gold](https://arxiv.org/abs/2510.07790)** **Read the paper on arxiv:** πŸ‘‰ https://arxiv.org/abs/2510.07790 **github:https://github.com/AchoWu/GCPO** **GCPO (Group Contrastive Policy Optimization)** is a novel reinforcement learning algorithm designed to enhance the reasoning capabilities of language models, especially in scenarios where the model fails to generate correct responses. Unlike previous methods like GRPO, which rely solely on the model’s own rollouts, GCPO introduces **Golden Answers (GAs)** β€” external reference answers β€” to guide the model’s updates when all sampled responses are incorrect. This approach ensures: βœ… **Full sample utilization** β€” no training data is wasted 🧠 **Knowledge transfer** β€” small models learn reasoning strategies from larger models πŸš€ **Faster convergence** and **better generalization** --- ## πŸ› οΈ Model Use ### βœ… Use with Hugging Face Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Ach0/GCPO-R1-1.5B" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", trust_remote_code=True) question = """ Solve the following math problem efficiently and clearly. The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{{ANSWER}}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering. Point $B$ is on $\\overline{AC}$ with $AB = 9$ and $BC = 21.$ Point $D$ is not on $\\overline{AC}$ so that $AD = CD,$ and $AD$ and $BD$ are integers. Let $s$ be the sum of all possible perimeters of $\\triangle ACD$. Find $s.$ """ messages = [ {"role": "user", "content": question} ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=8192) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### βœ… Use with vLLM(fast inference) ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_name = "Ach0/GCPO-R1-1.5B" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM(model=model_name, trust_remote_code=True) sampling_params = SamplingParams( temperature=0.7, top_p=0.8, top_k=20, max_tokens=8192 ) question = """ Solve the following math problem efficiently and clearly. The last line of your response should be of the following format: 'Therefore, the final answer is: $\\boxed{{ANSWER}}$. I hope it is correct' (without quotes) where ANSWER is just the final number or expression that solves the problem. Think step by step before answering. Point $B$ is on $\\overline{AC}$ with $AB = 9$ and $BC = 21.$ Point $D$ is not on $\\overline{AC}$ so that $AD = CD,$ and $AD$ and $BD$ are integers. Let $s$ be the sum of all possible perimeters of $\\triangle ACD$. Find $s.$ """ messages = [ {"role": "user", "content": question} ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) outputs = llm.generate([prompt], sampling_params) print(outputs[0].outputs[0].text) ``` --- ## πŸ“Š GCPO Improves Reasoning Performance GCPO consistently outperforms DAPO. ![image](https://cdn-uploads.huggingface.co/production/uploads/64d0a05d2f1f9578a0405b9d/8Nucelu3mFCWYF7D3QGnq.png) ---