Papers
arxiv:2510.07624

From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation

Published on Oct 8
· Submitted by Abdelhakim Benechehab on Oct 14
Authors:
,
,
,
,

Abstract

A bilevel optimization framework is used to align generative models with high-quality datasets in the absence of explicit reward signals, with applications in classification and model-based reinforcement learning.

AI-generated summary

Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .

Community

Paper author Paper submitter

📢📢 New preprint and code alert!!

How can we leverage Policy Gradient methods (e.g. REINFORCE, GRPO) without having access to explicit reward signals?

💡 In our new work, “From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation”, we address this question by bridging the gap between Maximum Likelihood Estimation and Policy Gradient methods.

📜 Preprint: https://arxiv.org/abs/2510.07624
🖥️ Code: https://github.com/abenechehab/nll_to_po

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.07624 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.07624 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.07624 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.