RLHFlow

university

RLHFlow

Activity Feed

AI & ML interests

Workflow of Reinforcement Learning from Human Feedback (RLHF). Blog: https://rlhflow.github.io/

Recent Activity

baohao updated a collection 10 days ago

Reinforce-Ada

baohao updated a collection 10 days ago

Reinforce-Ada

baohao updated a model 10 days ago

RLHFlow/Qwen2.5-Math-1.5B-DAPO-easy

View all activity

Papers

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

View all Papers

RLHFlow 's collections 12

Reinforce-Ada

Training & test sets and finetuned models

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Paper • 2510.04996 • Published about 1 month ago • 15
weqweasdas/math500

Viewer • Updated Mar 19 • 500 • 59
weqweasdas/aime_hmmt_brumo_cmimc_amc23

Viewer • Updated Sep 7 • 230 • 34
weqweasdas/olympiadbench

Viewer • Updated Mar 19 • 675 • 59

Online-DPO-R1

This is the collection of the online-DPO-R1 project.

RLHFlow/Qwen2.5-7B-PPO-Zero

8B • Updated Feb 17 • 1 • 2
RLHFlow/Qwen2.5-7B-DPO-Zero

8B • Updated Feb 17
RLHFlow/Qwen2.5-7B-DPO-NLL-Zero

8B • Updated Feb 17
RLHFlow/Qwen2.5-7B-RAFT-Zero

8B • Updated Feb 17

RLHFlow MATH Process Reward Model

This is a collection of datasets and models of process reward modeling.

RLHFlow/Mistral-PRM-Data

Viewer • Updated Nov 9, 2024 • 273k • 29 • 11
RLHFlow/Mistral-GSM8K-Test

Viewer • Updated Nov 2, 2024 • 1.32k • 7
RLHFlow/Mistral-MATH500-Test

Viewer • Updated Nov 9, 2024 • 500 • 9
RLHFlow/Llama3.1-8B-PRM-Mistral-Data

Text Generation • 8B • Updated Nov 9, 2024 • 38 • • 10

Mixture-of-preference-reward-modeling

The mixture of preference datasets used for reward modeling.

hendrydong/preference_700K

Viewer • Updated Sep 28, 2024 • 700k • 142 • 17
weqweasdas/preference_dataset_mixture2_and_safe_pku

Viewer • Updated Apr 29, 2024 • 555k • 125 • 12

PM-pair

This is a collection of materials for training pairwise preference model.

RLHFlow/pair-preference-dataset-mix1

Viewer • Updated May 6, 2024 • 548k • 13 • 3
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 8 • • 38
RLHFlow/pair_preference_model_dataset

Viewer • Updated Apr 20, 2024 • 699k • 67 • 6

RLHFLow Reward Models

Reward models trained by RLHFlow codebase (https://github.com/RLHFlow/RLHF-Reward-Modeling/)

RLHFlow/ArmoRM-Llama3-8B-v0.1

Text Classification • 8B • Updated Sep 23, 2024 • 11.1k • 182
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 8 • • 38
sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.19k • 60
RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13, 2024 • 71

Minimal-RL

RLHFlow/Qwen2.5-Math-7B-Zero-RAFTpp

Text Generation • 8B • Updated May 21 • 1
RLHFlow/Qwen2.5-Math-7B-Zero-Reinforce-Rej

Text Generation • 8B • Updated May 21 • 3 • 1

Decision-Tree Reward Models

RLHFlow/Decision-Tree-Reward-Gemma-2-27B

Text Classification • 27B • Updated Jan 24 • 1 • 8
RLHFlow/Decision-Tree-Reward-Llama-3.1-8B

Text Classification • 8B • Updated Jan 24 • 11 • 7
RLHFlow/LLM-Preferences-HelpSteer2

Viewer • Updated Feb 5 • 9.13k • 11 • 1

Standard-format-preference-dataset

We collect the open-source datasets and process them into the standard format.

RLHFlow/UltraFeedback-preference-standard

Viewer • Updated Apr 27, 2024 • 340k • 81 • 13
RLHFlow/Helpsteer-preference-standard

Viewer • Updated Apr 27, 2024 • 37.1k • 20 • 6
RLHFlow/HH-RLHF-Helpful-standard

Viewer • Updated Apr 27, 2024 • 115k • 250 • 2
RLHFlow/Orca-distibalel-standard

Viewer • Updated Apr 28, 2024 • 6.93k • 13 • 1

RM-Bradley-Terry

We train the reward model as the maximum likelihood estimation of the Bradley-Terry model.

sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.19k • 60
hendrydong/preference_700K

Viewer • Updated Sep 28, 2024 • 700k • 142 • 17
weqweasdas/RM-Mistral-7B

Text Classification • 7B • Updated Mar 31, 2024 • 16 • 24
weqweasdas/preference_dataset_mixture2_and_safe_pku

Viewer • Updated Apr 29, 2024 • 555k • 125 • 12

Online RLHF

Datasets, code, and models for online RLHF (i.e., iterative DPO)

RLHFlow/prompt-collection-v0.1

Viewer • Updated May 8, 2024 • 179k • 18 • 9
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 8 • • 38
sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.19k • 60
RLHFlow/SFT-OpenHermes-2.5-Standard

Viewer • Updated Apr 24, 2024 • 1M • 20 • 3

SFT Models

We train a series of SFT models on the high-quality SFT dataset of RLHFlow for research purpose.

RLHFlow/LLaMA3-SFT

Text Generation • 8B • Updated Nov 3, 2024 • 14 • • 10
RLHFlow/RLHFlow-SFT-Dataset-ver2

Viewer • Updated Nov 2, 2024 • 2.32M • 137 • 5
RLHFlow/LLaMA3-SFT-v2

Text Generation • 8B • Updated Nov 3, 2024 • 8 • • 2
RLHFlow/Llama3-SFT-v2.0-epoch1

Text Generation • 8B • Updated Nov 3, 2024 • 1

Reinforce-Ada

Training & test sets and finetuned models

Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training

Paper • 2510.04996 • Published about 1 month ago • 15
weqweasdas/math500

Viewer • Updated Mar 19 • 500 • 59
weqweasdas/aime_hmmt_brumo_cmimc_amc23

Viewer • Updated Sep 7 • 230 • 34
weqweasdas/olympiadbench

Viewer • Updated Mar 19 • 675 • 59

Minimal-RL

RLHFlow/Qwen2.5-Math-7B-Zero-RAFTpp

Text Generation • 8B • Updated May 21 • 1
RLHFlow/Qwen2.5-Math-7B-Zero-Reinforce-Rej

Text Generation • 8B • Updated May 21 • 3 • 1

Online-DPO-R1

This is the collection of the online-DPO-R1 project.

RLHFlow/Qwen2.5-7B-PPO-Zero

8B • Updated Feb 17 • 1 • 2
RLHFlow/Qwen2.5-7B-DPO-Zero

8B • Updated Feb 17
RLHFlow/Qwen2.5-7B-DPO-NLL-Zero

8B • Updated Feb 17
RLHFlow/Qwen2.5-7B-RAFT-Zero

8B • Updated Feb 17

Decision-Tree Reward Models

RLHFlow/Decision-Tree-Reward-Gemma-2-27B

Text Classification • 27B • Updated Jan 24 • 1 • 8
RLHFlow/Decision-Tree-Reward-Llama-3.1-8B

Text Classification • 8B • Updated Jan 24 • 11 • 7
RLHFlow/LLM-Preferences-HelpSteer2

Viewer • Updated Feb 5 • 9.13k • 11 • 1

RLHFlow MATH Process Reward Model

This is a collection of datasets and models of process reward modeling.

RLHFlow/Mistral-PRM-Data

Viewer • Updated Nov 9, 2024 • 273k • 29 • 11
RLHFlow/Mistral-GSM8K-Test

Viewer • Updated Nov 2, 2024 • 1.32k • 7
RLHFlow/Mistral-MATH500-Test

Viewer • Updated Nov 9, 2024 • 500 • 9
RLHFlow/Llama3.1-8B-PRM-Mistral-Data

Text Generation • 8B • Updated Nov 9, 2024 • 38 • • 10

Standard-format-preference-dataset

We collect the open-source datasets and process them into the standard format.

RLHFlow/UltraFeedback-preference-standard

Viewer • Updated Apr 27, 2024 • 340k • 81 • 13
RLHFlow/Helpsteer-preference-standard

Viewer • Updated Apr 27, 2024 • 37.1k • 20 • 6
RLHFlow/HH-RLHF-Helpful-standard

Viewer • Updated Apr 27, 2024 • 115k • 250 • 2
RLHFlow/Orca-distibalel-standard

Viewer • Updated Apr 28, 2024 • 6.93k • 13 • 1

Mixture-of-preference-reward-modeling

The mixture of preference datasets used for reward modeling.

hendrydong/preference_700K

Viewer • Updated Sep 28, 2024 • 700k • 142 • 17
weqweasdas/preference_dataset_mixture2_and_safe_pku

Viewer • Updated Apr 29, 2024 • 555k • 125 • 12

RM-Bradley-Terry

We train the reward model as the maximum likelihood estimation of the Bradley-Terry model.

sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.19k • 60
hendrydong/preference_700K

Viewer • Updated Sep 28, 2024 • 700k • 142 • 17
weqweasdas/RM-Mistral-7B

Text Classification • 7B • Updated Mar 31, 2024 • 16 • 24
weqweasdas/preference_dataset_mixture2_and_safe_pku

Viewer • Updated Apr 29, 2024 • 555k • 125 • 12

PM-pair

This is a collection of materials for training pairwise preference model.

RLHFlow/pair-preference-dataset-mix1

Viewer • Updated May 6, 2024 • 548k • 13 • 3
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 8 • • 38
RLHFlow/pair_preference_model_dataset

Viewer • Updated Apr 20, 2024 • 699k • 67 • 6

Online RLHF

Datasets, code, and models for online RLHF (i.e., iterative DPO)

RLHFlow/prompt-collection-v0.1

Viewer • Updated May 8, 2024 • 179k • 18 • 9
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 8 • • 38
sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.19k • 60
RLHFlow/SFT-OpenHermes-2.5-Standard

Viewer • Updated Apr 24, 2024 • 1M • 20 • 3

RLHFLow Reward Models

Reward models trained by RLHFlow codebase (https://github.com/RLHFlow/RLHF-Reward-Modeling/)

RLHFlow/ArmoRM-Llama3-8B-v0.1

Text Classification • 8B • Updated Sep 23, 2024 • 11.1k • 182
RLHFlow/pair-preference-model-LLaMA3-8B

Text Generation • 8B • Updated Oct 14, 2024 • 8 • • 38
sfairXC/FsfairX-LLaMA3-RM-v0.1

Text Classification • 8B • Updated Oct 14, 2024 • 1.19k • 60
RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13, 2024 • 71

SFT Models

We train a series of SFT models on the high-quality SFT dataset of RLHFlow for research purpose.

RLHFlow/LLaMA3-SFT

Text Generation • 8B • Updated Nov 3, 2024 • 14 • • 10
RLHFlow/RLHFlow-SFT-Dataset-ver2

Viewer • Updated Nov 2, 2024 • 2.32M • 137 • 5
RLHFlow/LLaMA3-SFT-v2

Text Generation • 8B • Updated Nov 3, 2024 • 8 • • 2
RLHFlow/Llama3-SFT-v2.0-epoch1

Text Generation • 8B • Updated Nov 3, 2024 • 1

AI & ML interests

Recent Activity

Papers

Team members 9

RLHFlow 's collections 12