Demystifying Prompt Series 34: A Different Approach to RLHF Training: Step by Step & Blue by Blue

In the previous chapters we discussed sample construction optimization and training strategy optimization for RLHF, and in this chapter we discuss two different RL training scenarios, process-based training, and using a weak Teacher to supervise a strong Student

Step by Step: PRM & ORM

Solving math word problems with processand
outcome-based feedback

PRM：Let's verify step by step

/openai/prm800k

### Data labeling

Trying to get the labeled samples needed for process supervision is actually a costly thing to do, as each step of the solution needs to be labeled as to whether it is correct or not. The paper chose 3 classifications of labels, POSITIVE for correct and reasonable reasoning, NEGATIVE for incorrect steps or logical errors, and NEURAL for ambiguity or presence of misdirection. The following

In order to maximize the value of the exorbitant annotation process, it is necessary here to ensure that the format of the generated solution samples is standardized (easy to split into multiple solution steps) and that the samples cannot be all EASY NEGATIVE or EASY POSITIVE. i.e., we need to solve for theinference formatcap (a poem)Sample ScreeningQuestion.

In order to ensure a stable inference format, here the paper trained Generator to use '\n' to split each solution step. In order to avoid the leakage of sample information due to this fine-tuning step, the paper uses few-shot to construct correctly formatted inference samples, then filters the samples with correct answers, and only uses the samples with incorrect answers but with correct format to train Generator. it ensures that the fine-tuning is only injected into the inference format, and does not inject additional mathematical knowledge and inference information.

In the sample screening step, the paper uses the current optimal PRM model to screen for high scoring, but wrong answerConvincing wrong answeranswers to build harder, process-supervised samples with more signals and where the PRM must have made at least one step of judgmental error in the current solution process to be manually labeled.

Since we see that here we are using PRM scoring to filter samples to train the PRM, naturally we use Iterated Training, which means that we will first build a wave of samples to train a PRM, use the newly trained PRM, score the N answers to the question, and then filter the Top K of the Convincing wrong answers to go for manual labeling, and then train the PRM, this process is iterated 10 times in total. The final result isPRM800K Solution Stepsof the training sample, including 75K answers sampled from 12K questions.

The training samples for the ORM are much simpler, only the answers to the questions are used. However, considering that the biased filtering of the samples by PRM above yields mostly samples with wrong answers, the samples for ORM are re-generated randomly on the same questions using Generator. So ORM and PRM do not have the same sample of answers.

Training and reasoning

In the training phase, ORM is the positive/negative classification goal of predicting whether the final answer is correct or not; the goal of PRM is to predict the positive/neural/negative of each solution step, and here the paper does not do any back-and-forth correlation of solution steps and simplyTrain each solution step independently as a classification sample, and thus is as much a classification task as ORM. The paper also mentions that because the objectives of pre-training LMs and categorizing CLMs are so different, theThe PRM obtained from low LR training is more stable, and only 2 Epochs are trained regardless of the model size

Here, although it feels that there is some strength in this assumption of PRM that each solution step is conditionally independent, it is true that if you label it as not independent, then the labeling cost would be another order of magnitude higher, and it is also not very realistic~

In the inference phase, the paper gives two inference schemes for PRM. One is to use the PRM to calculate the probability of each inference step being correct, and then product the scores of multiple inference steps to get a unique scoring for each answer, which is used to compare the strengths and weaknesses between multiple answers to the same question. One is to predict the first wrong step, so that the PRM and ORM will first be comparable, and for right answers both predictions are all right, and for wrong answers both predictions are that there is an incorrect step, except that the PRM will go further and give the exact location of the error.

effect

Effectively the paper uses Best-of-N Major Voting as a benchmark to compare the correctness of PRM and ORM filtered answers as follows, as the number of sampled answers N increases, the relative advantage of PRM over ORM and Major-Voting becomes more and more significant.

Considering that the above ORM and PRM training datasets are not the same, they are not considered strictly comparative experiments, after which the paper also did relatively comparable ablation experiments, which will not be repeated here.

In addition to the visual comparison of results, PRM has several alignment advantages over ORM

redit Assignment : For complex problems PRM can provide the exact location of the error making further iterative modifications easier, so the marginal value of PRM's reward scoring is higher.
Safer: PRM aligns against the process of COT, which is more consistent compared to just aligning the results (which may have process errors), and I personally feel that the probability of reward hacking will be relatively lower because the granularity of the alignment is finer
negative Alignment Tax: The paper found that PRM does not seem to have a decrease in effectiveness from alignment, and even has an increase in effectiveness.

Green is better than blue: weak-to-strong

WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION

/openai/weak-to-strong

Clipboard_2024

weak-to-strong is the answer delivered by the openAI alignment team at the end of its 23rd year. The essence of the paper is a simplified discussion of the super-alignment problem, that is, whether human supervision is still effective in guiding model behavior and ensuring model safety and command compliance when big models become more and more capable and even surpass humans. And how should this weak supervision proceed?

Therefore, super-alignment is essentially a "weak-to-supervise-strong" problem, and the simplification carried out in the thesis is to simplify the problem of human-supervised super-models analogously to the process of supervising a strong model by a weak model, i.e., the so-called "Weak-to-Strong Generalization."

The idea of the paper is actually very similar to the idea of weakly supervised, semi-supervised, noise-bearing learning that used to be hot a few years ago. It is to train weak models on task labels, then use the trained weak models for labeling, and then use the labels labeled by the models to train strong models, to see whether the effect of strong models can outperform the weak models. Logically weak-supervised semi-supervised, in fact, to improve the model's ability to generalize on unseen samples, while OpenAI here to study the Weak-to-Strong is more of a model can be generalized.

The paper can be divided into two parts, testing the generalization of weak-to-strong using regular fine-tuning and exploring how to improve the generalization of weak-to-strong, let's talk about each of them below

Experiment

First the paper chooses three task types to test the model generalization effect

NLP Classification Tasks: 22 NLP classification tasks including NLI, Categorization, CR, SA. This type of task is likely to perform well for both small and large models, with results improving but not significantly the larger the model is
Cheeses Puzzles: Chess challenges to predict the next best move. This kind of task may have more obvious model scale effect, small model can not do, have to model to a certain extent after the effect will be better and better!
ChatGPT Reward Model: predicts pair-wise human preferred model responses. There are no models that work well for this type of task right now, and the big ones and the small ones are all mediocre

The next step is to train on each of the above datasets as follows

weak supervisor: use the above data to train a small model to get the Teacher model
Weak-to-strong: use the above weak model to get labels from predictions on the held-out dataset and use these weakly supervised labels to train a larger and stronger model
strong ceiling: use the samples from the above task to train the strong model directly to get the model capacity ceiling

The three model effects obtained above should theoretically be weak-supervisor < weak-to-strong < strong-ceiling, and finally the paper measures the generalization effect of weak-to-strong supervision by calculating the percentage of the ability of weakly-supervised training to help the strong model recover, both Performance-Gap-Recovered(PGR)

Clipboard_2024

Below are the results of the direct fine-tuning experiments, and the following graphs show the task accuracies on the above three tasks for different model sizes of STRONG STUDENT (horizontal axis), WEAK TEACHER (colors), and the corresponding PGRs used to measure the generalization effect of weak-to-strong, respectively.

NLP task: the smallest Teacher training a Student many times larger also restores more than 20% of ability, with PGR increasing with both Teacher and Student size
On the Chess Puzzle task, when the Teacher model is small, Student hardly learns any information. As the Teacher gets larger, the PGR increases significantly, but as the Student gets larger the task accuracy and PGR decrease. There may be inverse scaling
On the ChatGPT Reward task, Student's task accuracy improves as Teacher gets larger, but the generalized PGR almost always does not exceed 20%.

Clipboard_2024

The difference in generalization between the three tasks here is actually related to the difficulty of the three tasks themselves, as mentioned earlier, and the correlation with model size. If we discuss from the perspective of noise learning, the NLP task has a low correlation with the model size, and the labeling noise is small; the Chess Puzzle has a high correlation with the single-model size and the gap between student-teacher, and the teacher labeling noise, as well as the consistency of the student-teacher predictions, varies with the model scale; the reward task is very general, and has little to do with model size; and the reward task is very general, and has nothing to do with the model size. varied; thereward tasks were all very general and had little to do with model size.

Overall a certain amount of ability generalization (PGR>0) can be obtained stably by direct fine-tuning, but the generalization effect is not good. So the following paper discusses whether the generalization of weak-to-strong can be improved by changing the training scheme.

Improvement

Option 1: Bootstraping

Use the progressive training scheme, that is, we can first use a small model to align a slightly larger model, and then use a slightly larger model to align a larger model, and then gradually iterate down. This training method may be more suitable for the task of Chess Puzzle above, considering the existence of Inverse Scaling in this task, that is, when the Student is bigger than the Teacher, the worse the generalization effect of weak-to-strong is, then we can gradually enlarge the size of the Student model to keep the gap between the Teacher and the Student not to be too big, then we can gradually enlarge the size of the Student model to keep the gap between the Teacher and the Student not to be too big. the gap between Student is not too large.

It is easy to see that comparing the PGR changes of the above inverse scaling, we can get almost the same PGR generalization effect on the same Teacher model, which means that the smaller model can help the larger model to recover the same proportion (but with a larger absolute value) of the ability. Here the paper uniformly uses 3 iterations, which means that two models of intermediate size will be trained.

Clipboard_2024

Option II: Regularization

If we want the large model to learn only the Insight obtained by the small model on the task, rather than simply mimicking the small model, we can do so by adding a regular term. The principle of minimum entropy inside semi-supervised learning is used, and the loss function of Pseudo Label is approximate, for those who are not familiar with it, see hereSmall Sample Sharpener 3. Semi-supervised Minimum Entropy Regularization

That is, the predictive entropy value of the STUDENT model is added to the original cross entropy (left), where f(x) is the big model in training and t is a dynamic threshold, the median of the predictive probability of the samples within the BATCH, so that the big model can reduce the loss function even if it doesn't go to learn from the TEACHER model, by increasing its own confidence in the predicted samples (Be confident! You're right), you can also reduce the loss function.

\[Lconf(f) = (1 − α) · CE(f(x), fw(x)) + α · CE(f(x), \hat{f}_t(x)) \]

The above loss function can also be rewritten as Bootstrap Loss in Noise Loss Function, for those who are not familiar with it, see here.Talking about loss functions 1. noise robust loss functions. That is, the label Student learns is obtained from a mixture of Teacher's predicted label, and Student model's own predicted label. The logic is the same, if you are confident in your predictions for this problem, then please continue to be confident!

\[Lconf(f) = CE(f(x), (1 − α) · fw(x) + α · \hat{f}_t(x)) \]

The addition of the above regular term significantly improves the generalization effect of weak-to-strong on the NLP task when the GAP between STUDENT and TEACHER is large, and even the smallest Teacher recovers nearly 80% of the effect of the large model, suggesting that lowering the probability that the STUDENT will imitate the TEACHER in a brainless manner is a very effective learning strategy .

Clipboard_2024

Why Generalization

Finally the thesis discusses why the weak-to-strong generalization exists and in what scenarios it exists. This is a big question, and the paper cannot exhaust all scenarios, so there is a targeted discussion of mimicry behavior and whether the STUDENT model itself is good at the task. Here's a brief summary of the main conclusions

Imitation
Here the paper measures whether the large model brainlessly fits the TEACHER model by the degree of overfitting, and the predictive consistency of STUDENT and TEACHER, respectively. It also proposes that a suitable regularity term, as well as an early stopping mechanism can reduce imitation and improve generalization
Sailency
The paper proposes that generalization is better when the strong model itself already has good task learning (representation) for that task through pre-training. Here I personally feel some like DAPT, TAPT (domain/taskadaptive pretraining) ideas, not familiar with the students see hereContinue Pretraining!. In terms of the spatial distribution of textual representations, this is when the model's representation of the spatial distribution in which the text of that task is located is itself more high-dimensionalLinearly divisible with clearer and smoother boundarieswhen the model is easier to generalize to that task.

To see a fuller compendium of papers related to large models - fine-tuning and pre-training data and frameworks - AIGC applications, move to Github >> DecryPrompt