Location>code7788 >text

Let LLM judge | Design your own evaluation prompt

Popularity:813 ℃/2025-02-27 20:53:33

Design your own evaluation prompt

This isLet LLM judgeThe third article in the series, please pay attention to the series:

  • Basic concepts
  • Select the LLM evaluation model
  • Design your own evaluation prompt
  • Evaluate your evaluation results
  • Related content of reward model
  • Tips and Tips

General prompt design suggestions

The general design principles of universal prompt on the Internet that I have summarized are as follows:

  • The task description is clear:
    • Your task is to do X (Your task is X).
    • You will be provided with Y (The information you get is Y).
  • The evaluation criteria are detailed and the scoring details are detailed (if necessary):
    • You should evaluate property Z on a scale of 1 - 5, where 1 means ... (Rated based on the performance of attribute Z, with a score range of 1 - 5, where 1 point means ...)
    • You should evaluate if property Z is present in the sample Y. Property Z is present if ... (Please indicate whether the property Z is present in sample Y. If so, then ...)
  • Add some “reasoning” evaluation steps
    • To judge this task, you must first make sure to read sample Y carefully to identify ..., then ... (Before evaluating this task, please read sample Y carefully to identify ..., then ...)
  • The output format is clear (adding specific fields can improve consistency)
    • Your answer should be provided in JSON, with the following format {"Score": Your score, "Reasoning": The reasoning which led you to this score} (Reasoning": Rating, "Reasoning": Rating reasoning process})

Prompt writing inspiration can be referencedMixEvalorMTBenchprompt template.

Other key points:

  • Pairwise comparison output scoreIt can better reflect human preferences, and is usually more robust
  • If the task really needs to score the output as a specific value, it is recommended to use integers and explain in detailThe meaning of each score, or add instructions promptFor example, provide 1 point for this characteristic of the answer, 1 additional point if ... (Asking for a certain feature, add 1 point if ...)wait
  • Try to use a specialized score prompt for every ability evaluation, and you will get better and robust results

Improve assessment accuracy

The accuracy of evaluation can be improved by the following methods or techniques (possibly increasing costs):

  • Few-shot example: Provide a small number of examples to help the model understand and reason, but also increase the context length.
  • Quote Reference: Provide reference content to improve the accuracy of model output.
  • *Thinking Chain (CoT): Required modelBefore ratingGiven the reasoning process,Improve accuracy(Refer to this articlePosts)。
  • Multi-round analysis: Can be betterDetect factual errors
  • Jury mechanism: Summary of the results of multiple evaluation modelsBetter than single model results
    • Using multiple small models to replace a large model can significantly reduce costs.
    • Multiple experiments can also be performed using multiple temperature parameters of a model.
  • The community unexpectedly discovered that the reward mechanism was introduced by the propt (For example: The correct answer will get a kitten) can improve the correctness of the answer. The effect of this method varies from scene to scene, and you can flexibly adjust it according to your needs.

Note: To reduce model bias, you can refer to the questionnaire design in sociology and then write prompt according to the usage scenario. If you want to use a model instead of manual evaluation, you can design similar evaluation indicators: such as calculating labeler consistency, using the correct questionnaire method to reduce bias, etc.

However, in practical applications, most people do not need a fully reproducible and high-quality and unbiased evaluation, and the fast and slightly rough prompt can meet the needs. (This situation is acceptable as long as you know the consequences of use).


Original English:/huggingface/evaluation-guidebook/refs/heads/main/translations/zh/contents/model-as-a-judge/

Original author: clefourrier

Translator: SuSung-boy

Reviewed: adeenayakup