Vigneshwaran 's Collections RLHF
updated
ORPO: Monolithic Preference Optimization without Reference Model
Paper
• 2403.07691
• Published
• 72
sDPO: Don't Use Your Data All at Once
Paper
• 2403.19270
• Published
• 41
Teaching Large Language Models to Reason with Reinforcement Learning
Paper
• 2403.04642
• Published
• 49
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
• 2404.07503
• Published
• 31
Rho-1: Not All Tokens Are What You Need
Paper
• 2404.07965
• Published
• 94
Learn Your Reference Model for Real Good Alignment
Paper
• 2404.09656
• Published
• 90
Dataset Reset Policy Optimization for RLHF
Paper
• 2404.08495
• Published
• 9
Insights into Alignment: Evaluating DPO and its Variants Across Multiple
Tasks
Paper
• 2404.14723
• Published
• 10
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
• 2405.07863
• Published
• 71
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
• 2405.11143
• Published
• 41
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Paper
• 2402.08609
• Published
• 36
Scaling Laws for Reward Model Overoptimization in Direct Alignment
Algorithms
Paper
• 2406.02900
• Published
• 13
Back to Basics: Revisiting REINFORCE Style Optimization for Learning
from Human Feedback in LLMs
Paper
• 2402.14740
• Published
• 18
HelpSteer2: Open-source dataset for training top-performing reward
models
Paper
• 2406.08673
• Published
• 19
Unpacking DPO and PPO: Disentangling Best Practices for Learning from
Preference Feedback
Paper
• 2406.09279
• Published
• 3
Understanding the performance gap between online and offline alignment
algorithms
Paper
• 2405.08448
• Published
• 18
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision
Paper
• 2312.09390
• Published
• 33
Theoretical guarantees on the best-of-n alignment policy
Paper
• 2401.01879
• Published
Iterative Preference Learning from Human Feedback: Bridging Theory and
Practice for RLHF under KL-Constraint
Paper
• 2312.11456
• Published
• 1
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Paper
• 2304.06767
• Published
• 2
Self-Play Preference Optimization for Language Model Alignment
Paper
• 2405.00675
• Published
• 28
Regularizing Hidden States Enables Learning Generalizable Reward Model
for LLMs
Paper
• 2406.10216
• Published
• 2
Scaling Laws for Reward Model Overoptimization
Paper
• 2210.10760
• Published
• 1
AgentInstruct: Toward Generative Teaching with Agentic Flows
Paper
• 2407.03502
• Published
• 51
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in
Alignment
Paper
• 2405.17931
• Published
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference
Learning
Paper
• 2405.00451
• Published
Foundations of Reinforcement Learning and Interactive Decision Making
Paper
• 2312.16730
• Published
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Paper
• 2408.07199
• Published
• 22
Disentangling Length from Quality in Direct Preference Optimization
Paper
• 2403.19159
• Published
Imitating Language via Scalable Inverse Reinforcement Learning
Paper
• 2409.01369
• Published
Contrastive Prefence Learning: Learning from Human Feedback without RL
Paper
• 2310.13639
• Published
• 25
D2PO: Discriminator-Guided DPO with Response Evaluation Models
Paper
• 2405.01511
• Published
Anchored Preference Optimization and Contrastive Revisions: Addressing
Underspecification in Alignment
Paper
• 2408.06266
• Published
• 10
Training Language Models to Self-Correct via Reinforcement Learning
Paper
• 2409.12917
• Published
• 140
The Perfect Blend: Redefining RLHF with Mixture of Judges
Paper
• 2409.20370
• Published
• 5
HelpSteer2-Preference: Complementing Ratings with Preferences
Paper
• 2410.01257
• Published
• 24
A Critical Evaluation of AI Feedback for Aligning Large Language Models
Paper
• 2402.12366
• Published
• 3
Rewarding Progress: Scaling Automated Process Verifiers for LLM
Reasoning
Paper
• 2410.08146
• Published
• 1
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement
Learning
Paper
• 2410.02089
• Published
• 13
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF
Paper
• 2411.01798
• Published
• 8
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks
with Reinforcement Fine-Tuning
Paper
• 2412.16849
• Published
• 9
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep
Thinking
Paper
• 2501.04519
• Published
• 288
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper
• 2501.18585
• Published
• 61
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open
Software Evolution
Paper
• 2502.18449
• Published
• 75
START: Self-taught Reasoner with Tools
Paper
• 2503.04625
• Published
• 113
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making
Abilities
Paper
• 2504.16078
• Published
• 21
All Roads Lead to Likelihood: The Value of Reinforcement Learning in
Fine-Tuning
Paper
• 2503.01067
• Published
Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM
Reasoners With Verifiers
Paper
• 2505.04842
• Published
• 12