2 63 133

momo

wzc991222

AI & ML interests

None yet

Recent Activity

liked a model 1 day ago

openbmb/MiniCPM4-8B

liked a model 10 days ago

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

new activity 10 days ago

deepseek-ai/DeepSeek-R1-0528:Summer or Winter?

View all activity

Organizations

wzc991222's activity

liked a model 1 day ago

openbmb/MiniCPM4-8B

Text Generation • Updated 1 day ago • 507 • 74

liked a model 10 days ago

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Text Generation • Updated 10 days ago • 153k • • 697

New activity in deepseek-ai/DeepSeek-R1-0528 10 days ago

Summer or Winter?

👀 🚀 71

#1 opened 10 days ago by

andromeda0302

liked a model 10 days ago

deepseek-ai/DeepSeek-R1-0528

Text Generation • Updated 10 days ago • 82.1k • • 1.84k

upvoted a paper 20 days ago

Qwen3 Technical Report

Paper • 2505.09388 • Published 25 days ago • 184

upvoted 2 papers 21 days ago

Parallel Scaling Law for Language Models

Paper • 2505.10475 • Published 24 days ago • 80

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 64

upvoted a paper 24 days ago

Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

Paper • 2505.09343 • Published 25 days ago • 64

liked a dataset 27 days ago

BoJack/MMAR

Viewer • Updated 1 day ago • 1k • 636 • 2

reacted to Kseniase's post with 👍 28 days ago

Post

4981

11 Alignment and Optimization Algorithms for LLMs

When we need to align models' behavior with the desired objectives, we rely on specialized algorithms that support helpfulness, accuracy, reasoning, safety, and alignment with user preferences. Much of a model’s usefulness comes from post-training optimization methods.

Here are the main optimization algorithms (both classic and new) in one place:

1. PPO (Proximal Policy Optimization) -> Proximal Policy Optimization Algorithms (1707.06347)
Clips the probability ratio to prevent the new policy from diverging too far from the old one. It helps keep everything stable

2. DPO (Direct Preference Optimization) -> Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2305.18290)
It's a non RL method, where an LM is an implicit reward model. It uses a simple loss to boost the preferred answer’s probability over the less preferred one

3. GRPO (Group Relative Policy Optimization) -> DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2402.03300)
An RL method that compares a group of model outputs for the same input and updates the policy based on relative rankings. It doesn't need a separate critic model
It's latest application is Flow-GRPO which adds online RL into flow matching models -> Flow-GRPO: Training Flow Matching Models via Online RL (2505.05470)

4. DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) -> DAPO: An Open-Source LLM Reinforcement Learning System at Scale (2503.14476)
Decouples the clipping bounds for flexibility, introducing 4 key techniques: clip-higher (to maintain exploration), dynamic sampling (to ensure gradient updates), token-level loss (to balance learning across long outputs), and overlong reward shaping (to handle long, truncated answers)

5. Supervised Fine-Tuning (SFT) -> Training language models to follow instructions with human feedback (2203.02155)
Often the first post-pretraining step. A model is fine-tuned on a dataset of high-quality human-written input-output pairs to directly teach desired behaviors

More in the comments 👇

If you liked it, also subscribe to the Turing Post: https://www.turingpost.com/subscribe