Let LLM judge | Tips and tips

Lack of internal consistency: The same prompt input evaluation model may be different when executed multiple times (if the temperature parameter is not set to 0).

Mitigation: Follow the self-consistency setting prompt, enter the model to execute multiple times and retain most results

Self-preference: LLM evaluation model morePrefer your own output mode, therefore, the score for similar patterns will be higher.

Mitigation measures: adopt jury mechanism

Insensitivity of input disturbance: Evaluate the modelDisturbance inputThe recognition effect is poor,Difficult to provide a consistent rating range(For more experimental results, please refer toThis link). For example, for texts that are subjected to the same degree of noise, ratings that use evaluation models to evaluate text quality cannot reflect the degree of noise.

Mitigation measures:
- The model is required to output a detailed inference process firstOutput score again
- Add a consistent scoring criteria in prompt

Position deviation: Evaluate the model moreAnswers to prefer specific locations. For example, when comparing pairs, Claude and GPT3.5 usually prefer a certain position in multiple tests, such as the first or second answer.

Mitigation measures:
- Randomly adjust the answer position
- Calculate logarithmic probability of all options and normalize

Length preference (length deviation): Evaluation models prefer lengthy answers.

Mitigation measures:Consider the length difference in the answer

Difficult to align human answers：

In all assessments,Whether manual assessment can serve as a good baseline is still controversial. For example, in certain specific fields (such as medicine, law, mathematics, etc.), if the marker is not professional enough, the results obtained may be as bad as using LLM directly.

Format deviation: If the prompt format of the input model is compared to the format of its training dataVery far apart, which may lead to inaccurate evaluation results of the model. For example, reference answers are provided in the training set data format of pairwise comparison models. If there is no given reference answer at the time of evaluation or the given reference answer format is incorrect, the evaluation results are unreliable.

Mitigation measures: Follow carefully the prompt format of the evaluation model training set (such as the format of the instruction fine-tuning model).