Let LLM judge | Evaluate your evaluation results

Evaluate your evaluation results

This isLet LLM judgeThe third article in the series, please pay attention to the series:

Basic concepts

Select the LLM evaluation model

Design your own evaluation prompt

Evaluate your evaluation results

Related content of reward model

Tips and Tips

Before using LLM to evaluate the model in production or on a large scale, you need to evaluate how effective it performs on the target task and make sure its ratings are consistent with the expected task performance.

Note:If the output of the evaluation model is binary classification, the evaluation will be relatively simple because there are many explanatory classification metrics available (such as accuracy, recall, and accuracy). But if the output is a score within a certain range, it will be more difficult to evaluate, because the correlation indicators of the model output and reference answers are difficult to map very accurately with the score.

After selecting the LLM evaluation model and designing prompt, you also need:

1. Select the baseline

You need to compare the evaluation results of the selected model with the baseline. The baseline can be of many types, such as: manual annotation results, standard answers, results from other well-performing evaluation models, outputs of other prompt corresponding models, etc.

The number of test cases does not need to be very large (50 are enough), but must be highly representative (e.g. edge cases), differentiated, and high-quality.

2. Select evaluation indicators

Evaluation indicators are used to compare the gap between evaluation results and reference standards.

Generally speaking, if the comparison object is a binary classification or pairwise comparison attribute of the model, it is very easy to calculate the evaluation indicators, because recall (binary classification), accuracy (pairwise comparison), and accuracy are generally used as Evaluate indicators, which are easy to understand and explainable.

If the comparison object is model score and human score, it will be more difficult to calculate the metrics. If you want to understand in depth, you can readThis blog。

In general, if you are not sure how to choose the right evaluation indicator or evaluation model, you can refer to itThis blogIn-housechart ⭐。

3. Evaluate your evaluation results

In this step, you only need to use the evaluation model and test prompt to evaluate the performance on the sample. After obtaining the evaluation results, you can use the evaluation indicators selected in the previous step to calculate the score.

You need to determine a threshold to determine the outcome attribution, and the threshold size depends on the difficulty of your task. For example, the accuracy index of pairwise comparison tasks can be set to 80% to 95%, and for example, the correlation index of the scoring ranking task, the Pearson correlation coefficient of 0.8 is often used in the literature, but some papers believe that 0.3 is sufficient to indicate that it is related to manual evaluation. Good correlation. So the standards are not dead, please adjust them flexibly according to the task!

> Original English text: /huggingface/evaluation-guidebook/refs/heads/main/translations/zh/contents/model-as-a-judge/>

Original author: clefourrier

Translator: SuSung-boy

Reviewed: adeenayakup