Let LLM Be the Judge | Choose an LLM Assessment Model

basic concepts

This isLet LLM be the judgeThe first article in a series of articles, please stay tuned for the series of articles:

basic concepts

Choosing an LLM Valuation Model

Design your own assessment prompt

Evaluate your assessment results

Reward model related content

Tips and Tips

What is an assessment model?

Judge models are a type ofNeural networks for evaluating other neural networks. Most often they are used to evaluate the quality of generated text.

The models evaluated range from small, specialized classifiers (e.g., “spam classifiers”) to large LLMs, either large and general or small and specialized. When using LLM as an evaluation model, you need to provide a prompt explaining the details of scoring the model (for example:Please rate the fluency of the sentence from 0 to 5, with 0 being completely incomprehensible,…)。

Using models as evaluation tools allows for efficient evaluation of complex and nuanced features in texts.
For example, the task of accurately matching predicted text and reference text can only evaluate the model's ability to predict correct facts or figures. But assessing more open-ended experiential abilities (such as text fluency level, poetic literary quality, or input fidelity) requires more sophisticated assessment tools.

This is where the evaluation model initially comes in.

They are generally used for three main tasks.

Score generated text: Evaluate certain properties of text (such as fluency, harmfulness, consistency, persuasiveness, etc.) using predefined scoring criteria and ranges.
pairwise comparison: Compare the two outputs of the model to select the text that performs better on a given attribute.
Calculate text similarity: Used to evaluate the match between reference text and model output.

Note: This article currently focuses on the evaluation method of LLM + prompt. However, it is recommended that you understand how the simple classifier evaluation model works, as this method has stable performance in many test cases. Some new promising approaches have also emerged recently, such as reward models as evaluation models (inthis reportproposed in , and also briefly written in this guidearticleIntroducing the reward model).

Advantages and Disadvantages of LLM Assessment Model:

Advantages:

objectivity: LLM assessment models are more objective in making empirical judgments automatically compared to humans.
Scalable and reproducible: The LLM evaluation model can be evaluated on very large-scale data, and the evaluation results can be reproduced.
lower cost: Compared with paying human annotators, the cost of evaluating the model is lower because there is no need to train a new model and the evaluation task can be performed using existing high-quality LLM and prompts.
Alignment with human judgment: LLM evaluation results are correlated to human judgment to a certain extent.

Disadvantages:

The LLM assessment model appears to be objective, but in fact has features that are more difficult to detect.hidden bias, because we cannot proactively discover these biases (see [model-as-a-judge/Tips and tricks] section). In addition, mitigating human bias can be achieved by designing some content-specific or statistically robust questionnaires (this has been studied in the field of sociology for nearly a hundred years), while mitigating LLM bias is not so mature. Additionally, using LLMs to evaluate LLMs may create an “echo chamber effect” that subtly reinforces the model’s inherent biases.
While LLM assessment models have the advantage of scale, they also generate large amounts of data that require careful examination. For example, a model can generate thought paths or data inferences, but the results produced require more analysis.
LLM evaluation models are generally cheap, but in some specific tasks if it is necessary to obtain higher quality evaluation results and hire expert human annotators, the cost will increase accordingly.

How to get started?

If you want to try setting up your own LLM valuation model, I recommend reading this book by Aymeric RoucherLLM Valuation Model Guide (⭐)！
Some tools to use:distilabelA code base that generates and iterates datasets based on LLM.Ultrafeedback paperThe methods mentioned in and the correspondingTutorial。Arena Hard benchmark implementation tutorial。

Original English text:/huggingface/evaluation-guidebook/blob/main/translations/zh/contents/model-as-a-judge/

Original author: clefourrier

Translator: SuSung-boy

Reviewer: adeenayakup