When we need to align models' behavior with the desired objectives, we rely on specialized algorithms that support helpfulness, accuracy, reasoning, safety, and alignment with user preferences. Much of a model’s usefulness comes from post-training optimization methods.
Here are the main optimization algorithms (both classic and new) in one place:
1. PPO (Proximal Policy Optimization) -> Proximal Policy Optimization Algorithms (1707.06347) Clips the probability ratio to prevent the new policy from diverging too far from the old one. It helps keep everything stable