arxiv:2509.23371

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Published on Sep 27

· Submitted by

junmingyang on Sep 30

Upvote

Authors:

Junming Yang ,

Abstract

Meta-Weighted Adaptive Preference Optimization (MetaAPO) dynamically balances online and offline data to align large language models with human preferences, outperforming existing methods and reducing annotation costs.

AI-generated summary

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.

View arXiv page View PDF Add to collection

Community

jmyang

Paper author Paper submitter 27 days ago

Problem:
Large language models (LLMs) are aligned with human preferences using methods like RLHF and DPO. Offline datasets are efficient but suffer from distribution mismatch with the evolving model; online data better matches the model distribution but is limited by the model’s capability and current alignment state, which can reduce diversity and quality. Existing hybrid methods fail to adaptively balance the two.

Method (MetaAPO):
The authors propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), which tightly couples data generation with training.

A meta-learner acts as an “alignment gap estimator,” predicting when online sampling is likely beneficial.
It assigns sample-wise weights to offline vs. online data, guiding targeted online generation and balancing their influence during optimization.
This adaptively focuses training on data that closes alignment gaps while avoiding redundant or noisy samples.

Results:

MetaAPO consistently outperforms offline, online, and hybrid baselines (DPO, PPO, SELM, etc.) on AlpacaEval 2, Arena-Hard, and MT-Bench.
Uses only 58% of online annotations, cutting annotation costs by 42%.
Ablation studies confirm that adaptive sampling, meta-weighting, and the learnable meta-learner are all critical to performance.

Takeaway:
MetaAPO bridges the gap between offline and online preference alignment by adaptively integrating both via a meta-learner. It delivers stronger alignment with lower cost, demonstrating a scalable path forward for efficient and effective LLM preference optimization.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.23371 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.23371 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.23371 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.