Let LLM judge | Reward model related content

What is a reward model?

The reward model predicts scores by learning artificially annotated paired prompt data, with the optimization goal being to align human preferences.
After training is completed, the reward model can be used as a reward function for manual evaluation agents to improve other models.

Pairwise comparison ratings

The most common type of reward model is the Bradley-Terry model, whose output is a score, following the following formula:

\[p(\text{Answer b is better than answer a}) = \text{sigmoid}(\text{score}_b - \text{score}_a) \]

The training data for the reward model only requires pairwise comparison answers, which is easier than collecting score data. Therefore, a trained model can only compare which of the multiple answers under the same prompt, and cannot compare across prompt.

Other models are extended based on this method to predict the probability values of one answer being better than the other (e.g.Reward model based on LLaMA3)。

In this way, the model can (theoretically) judge the numerical differences between multiple answers, but it can only compare the answers corresponding to the same prompt, and the probability values of answers across prompt are of no comparison. In addition, when the answer is longer, it may be affected by context length and memory limitations.

Absolute fraction

There are also some reward models (such asSteerLM) output is an absolute fraction. This type of model is more convenient to use, and can directly evaluate the scores of the answers without having to be constructed as a right. But data collection is more difficult because absolute scores appear relatively less stable when measuring human preferences.

Recently, a stronger model has been proposed, which can output both absolute and relative scores at the same time. likeHelpSteer2-PreferenceandArmoRM。

Methods for evaluation of reward models

Given a prompts dataset, enter LLM to generate the answer and request the reward model to score the answer.

If the reward model output is an absolute score, you can average the scores of all answers to get the final score.

In fact, the more commonly used reward model output is relative scores, and averaged on them may be affected by outliers (some very good or very bad answers), because the evaluation scores of different prompts may have different scales (some prompts will simpler or harder than others).

In general, we can use:

Win rate: Take a set of reference answers and calculate the percentage of model answers that are better than reference answers, and this result will be more refined.
Win probability probability: Take a set of reference answers and calculate the average probability that the model answer is better than the reference answer. This result can provide a more detailed and smooth signal.

Pros and cons of reward models

Advantages:

Very fast: The reward model only needs to get one score (unlike the LLM evaluation model that needs to get long text), so its inference speed is comparable to that of the small model.
Have certainty: The forward process is the same, and the final score will remain consistent.
Insensitive to position deviation: Most reward models only process one answer at a time, so they are rarely affected by order. Even for reward models that need to process paired answers, as long as the data it uses during training is sequentially balanced, it is very little affected by position deviation.
No prompt project required: Obviously, the reward model has only one task goal, which is to output scores for one or two answers by training preference data.

Disadvantages:

Requires specific fine adjustment: Even if many of the capabilities of the basic model are inherited after fine-tuning, it may still perform poorly on tasks that exceed the scope of the training set, and the cost of this step will be relatively expensive.
Inefficient in reinforcement learning and task evaluation (or when using direct alignment algorithms to process data sets similar to the reward model training set): The language model will overfit the reward model's preference data.

Tips and Tips for Evaluation Using Reward Model

RewardBench Leaderboard: Reward model ranking list, you can find many high-performance models.
NemotronThe paper introduces the experience of using the reward model.
For reward models that only score a single prompt and answer, multiple model results can be cached, and conclusions can be quickly drawn when testing the performance of a new model.
This paperThe winning rate or winning rate probability during training is tracked and organized to help detect model degradation and select the best weight.

Original English:/huggingface/evaluation-guidebook/refs/heads/main/translations/zh/contents/model-as-a-judge/

Original author: clefourrier

Translator: SuSung-boy

Reviewed: adeenayakup