RLHF training algorithms popularize

Reinforcement learning is being used in LLM more and more. This article uses life examples to compare and help understand related concepts based on several common training algorithms.

Including: PPO, DRO, DPO, β-DPO, sDPO, RSO, IPO, GPO, KTO, ORPO, SimPO, R-DPO, RLOO, and GRPO.

PPO（Proximal Policy Optimization, near-end strategy optimization)

background:

In reinforcement learning, policy updates may cause excessive changes in the strategy, resulting in instability in training. PPO solves this problem by limiting the magnitude of policy updates to ensure the stability of the training process.

example:

You go to the fruit stall to pick apples. The stall owner has a lot of apples. You want to pick out the sweetest apples, but you don’t want to taste all the apples every time, nor do you want to lose the sensitivity of your taste because you taste too much.

Reward Signal:You have a sweetness meter that roughly measures the sweetness of an apple. This sweetness meter is a "reward signal" to help you determine which apple is sweeter.

Stay stable:Instead of trying all the apples at once, you taste a few first and then gradually adjust your selection strategy based on the sweetness meter reading. For example, if you taste 5 apples first and find that 3 of them are sweeter, you will continue to choose among these 3 instead of changing a whole batch of apples at once.

Individual training reward signal:You also need to calibrate your sweetness meter individually to make sure it accurately measures sweetness. For example, you might first taste some apples with known sweetness to adjust the accuracy of the sweetness meter.

paper：

Proximal Policy Optimization Algorithms

DRO (Direct Reward Optimization, Direct Reward Optimization)

background:

In some reinforcement learning tasks, direct optimization of reward signals may be more direct and effective than optimization strategies, especially when reward signals are very clear. DRO directly optimizes the reward signal, avoiding complex strategy update process.

example:

You go to the fruit stall to pick apples, and the stall owner will give you a sweetness meter, allowing you to select apples based on the reading of the sweetness meter.

Direct Reward:You directly measure the sweetness of each apple with a sweetness meter and then select the sweetest apple based on the readings of the sweetness meter. For example, if you measure 10 apples and find that 3 of them have the highest sweetness, you will continue to select these 3, and directly use the reading of the sweetness meter as a reward signal to select the sweetest apple.

paper:

Offline regularised reinforcement learning for large language models alignment

DPO (Direct Preference Optimization, Direct Preference Optimization)

background:

In some cases, direct comparison of the preferences of samples may be more effective than relying on reward signals, especially if the reward signals are inaccurate or difficult to obtain. DPO optimizes selection strategies by directly comparing the preferences of samples.

example:

This time, you are still picking apples at the fruit stall, but you don’t want to use a reward signal like sweetness meter, but instead directly compare the taste of the apples.

Positive and negative samples:You took two apples from the stall owner, one you took casually (negative example), and the other one was recommended by the stall owner (positive example). For example, the stall owner said that this apple was particularly sweet.

Comparative optimization:You can directly compare these two apples and try which one is sweeter. For example, you find that the apples recommended by the stall owner are indeed sweeter than you can take.

straightConnect to optimization：You learn directly how to choose sweet apples through this direct comparison without the need to calibrate the sweetness meter separately.

paper:

Direct preference optimization: Your language model is secretly a reward model

β-DPO (Direct Preference Optimization with Dynamic)

background:

In Direct Preference Optimization (DPO), performance is very sensitive to the selection of hyperparameter β and is also sensitive to the quality of preference data. The static β value performs unstable under different data qualities, resulting in the insolent optimization process. β-DPO solves this problem by dynamically adjusting the β value, allowing it to automatically adjust according to data quality, thereby improving the robustness and adaptability of the model.

example:

If you go to the fruit stall to pick up apples, the stall owner will give you two apples, one you take casually (negative example), and the other one is recommended by the stall owner (positive example). You want to learn how to pick sweeter apples by comparing these two apples.

Dynamically adjust β value:You find that different apples have a very different perception of sweetness. Some apples have obvious differences in perceived sweetness, while others are not. You decide to dynamically adjust your "sweetness preference parameter" β based on the difference in sweetness perceptions per apple. If the sweetness of the two apples differs greatly, you will lower the beta value, which means you will update your preferences more actively; if the difference is not large, you will increase the beta value to maintain the original preferences and avoid excessive Adjustment.

Data filtering:When comparing apples, you notice that some apples have very small differences in sweetness and are almost negligible. These apples may mislead your learning process because their differences are not enough to provide effective learning signals. Therefore, you decide to filter out these pairs of apples that have too small differences and only those pairs with obvious sweetness differences to learn.

Batch adjustment:You realize that adjusting the beta value based on only one pair of apples at a time can lead to instability in the learning process. Therefore, you decide to adopt a batch adjustment method. Each time, take a set of apples (such as 5 pairs) from the stall owner, and then calculate an average β value based on the sweetness difference of this set of apples to be used for the learning of this set of apples. This approach can reduce the impact of a single apple and make the learning process more stable.

paper:

β-DPO: Direct preference optimization with dynamic β

sDPO (stepwise Direct Preference Optimization, stepwise DPO)

background:

In complex environments, comparing all samples at once can lead to selection difficulties or policy instability. sDPO gradually optimizes the selection strategy by gradually comparing samples and gradually optimizing the selection strategy to ensure that each step of decision is based on the current optimal information.

example:

This time, you go to the fruit stand to pick apples, and you want to optimize your choices through gradual comparisons.

Gradually compare:You first took two apples from the stall owner to compare their sweetness. Then, you take two more apples and continue to compare.

Gradually optimized:After each comparison, you will remember which apple is sweeter and gradually adjust your selection strategy. For example, you first compare A and B and find that A is sweeter, then you compare C and D and find that C is sweeter, and finally you compare A and C to determine which one is sweeter.

Step by step:This method gradually optimizes your selection strategy through gradual comparison, avoiding the confusion caused by comparing too much Apple at one time.

paper:

sDPO: Don't use your data all at once

RSO (Rejection Sampling Optimization, reject sampling optimization)

background:

In the case of uneven sample quality, directly rejecting low-quality samples can improve training efficiency and model performance. RSO ensures that the training process is based on high-quality samples only by rejecting low-quality samples.

example:

This time, you go to the fruit stall to pick apples, and the stall owner will give you a big basket of apples, and the quality of the apples inside is uneven.

Preliminary filtering:You can choose some apples first and try them. For example, if you grab 10 apples from the basket and taste them, you find that 3 of them are particularly unsweet. You directly reject these 3 apples and put them back in the basket without considering them.

excellentStrategy:Then you continue to pick from the remaining apples and try to see which ones are sweeter. For example, if you taste 3 of the remaining 7 apples and find that 2 of them are sweeter, you will continue to choose among these 2, and remember the experience of rejection this time, try to choose as much as possible next time. Avoid those unsweet apple types.

paper:

Statistical rejection sampling improves preference optimization

IPO (Identity Preference Optimization, Identity Preference Optimization)

background:

In some cases, the identity of the sample (such as origin, brand, etc.) may have a significant impact on the selection results. IPO optimizes selection strategies by considering the identity information of the sample, ensuring that selection results meet specific preferences.

example:

If you go to the fruit stall to pick up apples, the stall owner will tell you that these apples have different origins, such as those from Shandong, Shaanxi, and Liaoning.

Identity tag:You first classify the apples according to the origin, such as putting a bunch of apples from Shandong, a bunch of apples from Shaanxi, and a bunch of apples from Liaoning. You think apples from different origins may have different tastes, so you mark them with identities separately.

Preferred learning:You first try the apples from each place of origin. For example, after you taste the apples from Shandong, you feel sweeter after you taste the apples from Shaanxi, you feel moderately sweet and sour, and after you taste the apples from Liaoning, you feel a little worse. Then, based on this preference, you will give priority to the apples from the origin you think are good-tasting, such as choosing more from Shandong apples and less from Liaoning apples.

paper:

A general theoretical paradigm to understand learning from human preferences

GPO (Generalized Preference Optimization, Generalized Preference Optimization)

background:

In some cases, a single indicator of preference (such as sweetness) may not be sufficient to describe the pros and cons of the sample. GPO optimizes the selection strategy by comprehensively considering multiple factors (such as sweetness, size, color, etc.) to ensure that the selection results are more in line with actual needs.

example：

When you go to the fruit stall to pick apples, this time you want to consider a variety of factors to choose apples, not only the sweetness, but also the taste, size, color, etc.

Comprehensive evaluation:You can try a few apples first and observe their size, color, etc. For example, if you taste 5 apples, you find that 2 of them are very sweet, but one of them is small and not very good-looking; the other one is large and bright in color. After you think about it comprehensively, you think that large and brightly colored apples are better.

Preference adjustment:Then you adjust your selection strategy based on this comprehensive preference. Next time you choose, you will give priority to large, bright colors and good taste apples instead of simply pursuing sweetness.

paper:

Generalized preference optimization: A unified approach to offline alignment

KTO (Kahneman-Tversky Optimization, Kahneman-Tversky Optimization)

background:

In some cases, human decision-making processes are affected by psychological factors such as loss aversion. KTO optimizes selection strategies by simulating this psychological factor to ensure that the selection results are more in line with the actual decision-making process of human beings.

example:

If you go to the fruit stall to pick apples, the stall owner will give you two options.

Plan 1: You taste an apple first, if it is sweet, I will give you one; if it is not delicious, there is nothing.

Solution 2: You taste an apple first, and if it is not delicious, you will give it one; if it is sweet, there is nothing.

Loss aversion:You are more inclined to choose option one, because you don’t want to taste bad apples and you are more afraid of losing. This psychology reflects the loss aversion consideration in Kahneman-Tversky optimization. When choosing apples, you will try to avoid choosing apples that may lead to bad results, such as those that look bad in color and have blemished skin.

paper:

KTO: Model alignment as prospect theoretic optimization

ORPO (Odds Ratio Preference Optimization, ratio preference optimization)

background:

In some cases, the preference for directly comparing the sample may not be accurate enough. ORPO optimizes the selection strategy by calculating the odds ratio between samples to ensure that the selection results are more accurate.

example:

You go to the fruit stall to pick apples, and the stall owner gives you two apples to compare their taste.

Ratio calculation:You tasted these two apples and found that one was sweet and the other was not too sweet. You calculate their taste ratio, for example, sweet apples have a taste score of 8, less sweet apples have a taste score of 4, and the ratio is 2. You can judge which apple is better based on this ratio.

Preference update:Then you update your selection preferences based on this ratio. Next time you encounter similar apples, you will give priority to apples with high taste scores.

paper:

ORPO: Monolithic preference optimization without reference model

SimPO (Simple Preference Optimization, Simple Preference Optimization)

background:

In some simple tasks, the preference of directly comparing samples may be effective enough. SimPO optimizes selection strategies through simple preference comparisons, suitable for simple decision-making tasks.

example:

You go to the fruit stall to pick apples, and the stall owner gives you two apples and asks you to compare them.

Simple comparison:You can try these two apples directly and compare which one is sweeter. For example, if you find that the apple on the left is sweeter than the one on the right, you will remember that the apple on the left is better.

Preferred learning:Then you learn the preferences of selecting apples based on this simple comparison. Next time you encounter a similar apple, you will give priority to the type of apple on the left.

paper:

SimPO: Simple preference optimization with a reference-free reward

R-DPO (Regular DPO, regularized DPO)

background:

In some cases, direct optimization preferences can lead to overcomplication of the model or overfitting. R-DPO optimizes selection strategies by introducing regularization constraints to ensure the simplicity and generalization capabilities of the model.

example:

You go to the fruit stall to pick apples, and the stall owner will give you two apples to compare, but this time you not only compare their taste, but also consider other factors, such as the size and color of the apples.

justRestriction constraints:When you compare the taste of two apples, you will also consider their size, color and other factors. For example, you find that one apple has a good taste but is too small; the other apple has a slightly worse taste but is large and bright in color. After considering these factors, you will be more inclined to choose large and brightly colored apples.

Preference optimization:Through this regularization constraint, you optimize your selection preferences, not only considering the taste, but also other factors, making the selected apples more in line with your comprehensive requirements.

paper:

Disentangling length from quality in direct preference optimization

RLOO（REINFORCE Leave-One-Out）

background:

In some cases, mutual influence between samples may lead to instability in the selection strategy. RLOO optimizes the selection strategy through the one-standard verification, ensuring that each step of decision is based on independent samples.

example:

You go to the fruit stall to pick apples, and the stall owner will give you a basket of apples to choose from.

KeepOne-method verification:You first take out an apple from the basket and try it. For example, if you take out an apple, you find it sweet. Then you divide the remaining apples into several groups, take out an apple for each group to taste, and see which group of apples are sweeter than the one you took before.

Policy update:Through this method of staying one, you keep updating your selection strategy, and next time you choose, you will prioritize the apples in the apple group that are sweeter than the one you took before.

paper:

Back to basics: Revisiting reinforce style optimization for learning from human feedback in LLMs

GRPO (Group Relative Policy Optimization, Group Relative Policy Optimization)

background:

In some cases, evaluating each sample individually may not be efficient enough. GRPO optimizes selection strategies through group comparisons, ensuring that each step of decision-making is based on the relative advantages of the group.

example:

This time, you are still picking apples at the fruit stand, but you don’t want to evaluate each apple individually, but optimize it through group comparison.

Group comparison:You took 5 apples from the stall owner and divided them into two groups, 2 in each group, and the remaining one is left alone. For example, you divide apples into Group A and Group B.

Relative evaluation:You try the apples in each group and compare which one is sweeter. You don't need to give each apple a specific sweetness score, you just need to know which apple is sweeter than the other. For example, you find that the apples in Group A are sweeter than the apples in Group B.

Eliminate negative examples:You exclude those unsweet apples (negative examples). For example, if you find that the apples in Group B are not sweet, you no longer consider them and only focus on the apples in Group A.

Update policy:Tell yourself that next time you pick apples, you should prioritize Group A apples because they are sweeter. For example, if you find that the apples in Group A are brighter in color and have a smoother skin, you should remember these characteristics and give priority to such apples next time you choose apples.

paper:

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models