Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Abstract
Meta-Weighted Adaptive Preference Optimization (MetaAPO) dynamically balances online and offline data to align large language models with human preferences, outperforming existing methods and reducing annotation costs.
Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.
Community
Problem:
Large language models (LLMs) are aligned with human preferences using methods like RLHF and DPO. Offline datasets are efficient but suffer from distribution mismatch with the evolving model; online data better matches the model distribution but is limited by the model’s capability and current alignment state, which can reduce diversity and quality. Existing hybrid methods fail to adaptively balance the two.
Method (MetaAPO):
The authors propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), which tightly couples data generation with training.
- A meta-learner acts as an “alignment gap estimator,” predicting when online sampling is likely beneficial.
- It assigns sample-wise weights to offline vs. online data, guiding targeted online generation and balancing their influence during optimization.
- This adaptively focuses training on data that closes alignment gaps while avoiding redundant or noisy samples.
Results:
- MetaAPO consistently outperforms offline, online, and hybrid baselines (DPO, PPO, SELM, etc.) on AlpacaEval 2, Arena-Hard, and MT-Bench.
- Uses only 58% of online annotations, cutting annotation costs by 42%.
- Ablation studies confirm that adaptive sampling, meta-weighting, and the learnable meta-learner are all critical to performance.
Takeaway:
MetaAPO bridges the gap between offline and online preference alignment by adaptively integrating both via a meta-learner. It delivers stronger alignment with lower cost, demonstrating a scalable path forward for efficient and effective LLM preference optimization.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper