Thesis Interpretation《From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge》

Published: 2024
Journal conference: arxiv
Dissertation by Arizona State University
Author(s): Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu Kai Shu, Lu Cheng, Huan Liu
Link to paper

What is LLM-as-a-judge

define: LLM-as-a-judge refers to the use of the advanced text comprehension and generation capabilities of a large-scale language model (LLM) to evaluate, judge, or make decisions about a particular task or problem, similar to the role of a judge in a competition.

mathematical expression: Given a judge LLM\(J\), the assessment process can be expressed as follows:

\[R = J(C_1, \ldots, C_n) \]

Here.\(C_i\)It's the first one to be judged.\(i\)Candidates.\(R\)is the result of the judgment.

Input
- Point-Wise
  
  When n = 1, it becomes a point-by-point judgment, where judge LLMs will focus on only one candidate sample
- Pair / List-Wise
  
  When n ≥ 2, it becomes a pair (n=2) or list (n>2) judgment, where multiple candidate samples are offered together to the judge LLM for comparison and to make a comprehensive evaluation
Output
- Score
  
  when each candidate sample is assigned a continuous or discrete score:\(R=\{C1：S1，...，Cn：Sn\}\)When it becomes a score-based judgment, this is the most common and widely used protocol
- Ranking
  
  In ranking-based judgment, the output is the ranking of each candidate sample, denoted as\(R=\{Ci>...>Cj\}\)This method of comparison is very useful in situations where a ranking order needs to be established between candidates
- Selection
  
  In selection-based judgment, the output involves selecting one or more best candidates, denoted as\(R=\{Ci，...，Cj\}>\{C1，...Cn\}\)This approach is especially useful in decision-making or content filtering environments

Taxonomy of research in LLM-as-a-judge

Attributes

What to judge? We delve into specific attributes assessed by the judge LLMs, including helpfulness, harmlessness, reliability, relevance, feasibility and overall quality.

Helpfulness

This dimension evaluates whether the response generated by the model is helpful to the user. In other words, whether the model's response effectively solves the user's problem or fulfills the user's need.

Relies on a large amount of useful and useless data, which is then used in reinforcement learning for comparison training

Harmlessness

Non-harmfulness refers to whether the model's answers avoid negative impacts or undesirable consequences. Models should avoid providing offensive or misleading information or advice that could lead to dangerous behavior.

Use of principles to guide judge llm to conduct a non-harmful assessment for alignment purposes
Fine-tuning of safe/unsafe data
It was found that small LLMs are valid security judgments when used in fine-tuned settings
Rewindable Autoregressive Inference (RAIN) is proposed to allow LLM to self-evaluate and rewind to ensure AI safety

Reliability

Reliability assesses whether the content generated by the model is accurate and credible. This includes the truthfulness and accuracy of the information as well as the consistency of the model in answering questions.

Selecting relevant evidence and providing detailed comments to enhance factual assessment
Automated assessment methods using the GPT-4 to determine whether model outputs are hallucinogenic or not
API based on new datasets, or equipped with external search for llm

Relevance

Relevance examines whether the model's answers are directly related to the question posed by the user. Models should avoid going off topic or providing irrelevant information.

Traditional approaches typically rely on keyword matching or semantic similarity
Directly replace expensive and time-consuming manual annotations with LLM judgments
Also used for multimodal applications of relevance judgment, RAG, SQL equivalence, search, retrieval, and recommendation

Feasibility

Feasibility measures the practical feasibility of the model's recommendations. That is, whether the solution or recommendation provided by the model is feasible or implementable in the real world.

Feasibility assessment using indicators or external tools
Based on structures such as trees, graphs, etc., the llm itself is utilized to select the most appropriate actions
Also used in API selection, tool using and LLM routing.

Overall Quality

Overall quality is a comprehensive metric that typically measures the overall strengths and weaknesses of model-generated content based on performance across multiple specific dimensions.

Calculate the average of the scores for specific aspects
Direct generation of overall judgments (e.g., summaries, translations)

Methodology

Methodology: How to judge? We explore various tuning and prompting techniques for LLMas-a-judge systems, including manually-labeled data, synthetic feedback, supervised fine-tuning, preference learning, swapping operation, rule augmentation, multi-agent collaboration, demonstration, multi-turn interaction and comparison acceleration.

Tuning

Data Source
- Manually-labeled
  
  Manually labeled data: an intuitive way to train a judge LLM with human-like criteria is to collect manually labeled samples and corresponding judgments.
- Synthetic Feedback
  
  While manually labeled feedback is of high quality and accurately reflects human judgmental preferences, it has limitations in terms of volume and coverage. As a result, some researchers have turned to synthesized feedback as a source of data for adapting judge LLMs.
Tuning Techniques
- Supervised Fine-Tuning
  
  Supervised fine-tuning (SFT) is the most commonly used approach to facilitate the judge LLMs to learn from pairwise or pointwise judgment data.
- Preference Learning
  
  Preference learning is closely aligned with judgment and evaluation tasks, especially comparative and ranking judgment. In addition to works that directly adopt or augment preference learning datasets for supervised finetuning judge LLMs, several studies apply preference learning techniques to enhance LLMs’ judging capabilities.

Prompting

Swapping Operation

Previous research has shown that LLMbased judges are sensitive to the position of candidates, and that the quality ranking of candidate responses can be easily manipulated by changing their order in context. In order to mitigate this positional bias and create a fairer LLM judging system, swapping operations have been introduced and widely adopted. The technique involves invoking judge LLM twice, swapping the order of the two candidates in each instance. In the evaluation, if the results are inconsistent after the exchange, they are labeled as a "tie", indicating that the LLM is unable to confidently distinguish the quality of the candidates.
Rule Augmentation

Rule-enhanced prompting involves embedding a set of principles, references, and assessment rules directly within the LLM judge's prompt. This approach is commonly used in LLM-based assessments to allow for a more precise and directed comparison of two candidates.
Multi-Agent Collaboration

Obtaining results from individual LLM judges may not be reliable due to various biases inherent in LLMs. To address this limitation, the Peer Ranking (PR) algorithm was introduced, which takes into account the pairwise preferences of each peer LLM for all pairs of answers and produces the final ranking of the model.
Demonstration

Contextual samples or demonstrations provide concrete examples for the LLM and have been shown to be a key factor in the success of LLM contextual learning. A number of studies have introduced human assessment results as demonstrations of the LLM as a judge, with the aim of guiding the LLM in learning assessment criteria from a number of specific contextual samples.
Multi-Turn Interaction

In an assessment, a single response may not provide the LLM judge with enough information to thoroughly and fairly evaluate each candidate's performance. To address this limitation, multiple rounds of interaction are often used to provide a more comprehensive assessment. Typically, the process begins with an initial query or topic, followed by a dynamic interaction between the judge LLM and the candidate model.
Comparison Acceleration

Of the various comparison formats in LLM-asa-judge (e.g., Point-wise and Pair-wise), pairwise comparisons are the most common method for directly comparing two models or generating pairwise feedback. However, this approach can be very time-consuming when multiple candidates need to be ranked. To alleviate the computational overhead, Zhai et al. proposed a ranked pairwise approach in which all candidates are first compared to a blank baseline response. Each candidate is then ranked by their performance compared to the baseline.

Application

Application: Where to judge? We investigate the applications in which LLM-as-a-judge has been employed, including evaluation, alignment, retrieval and reasoning.

Evaluation

Traditional evaluation in NLP relies on predefined criteria, usually through metrics, to assess the quality of machine-generated text. Some prominent metrics such as BLEU, ROUGH and BERTScore have been widely used in the field. However, metric-based assessments overemphasize lexical overlap and similarity, which can be lacking when many valid responses and more subtle semantic properties need to be considered. To address these limitations, LLMs act as judges capable of human-like qualitative assessment, rather than simply quantitatively comparing how well machine-generated output matches the underlying facts.

Open-ended Generation Tasks

Open-ended generation refers to tasks in which the content generated should be secure, accurate, and relevant to the context, even though there is no one "right" answer. These tasks include dialogic response generation, summarization, story generation, and creative writing.
Reasoning Tasks

The reasoning ability of LLMs can be assessed by their intermediate thought processes and final answers to specific reasoning tasks. Recently, LLM-as-a-judge has been used to assess the logical progression, depth, and coherence of a model's intermediate reasoning paths. For mathematical reasoning tasks, Xia et al. introduced an automated assessment framework using judge LLM, specifically designed to assess the quality of reasoning steps during problem solving.
Emerging Tasks

With the rapid development of LLM capabilities, machines are increasingly being used for tasks that were previously considered uniquely human, especially in context-specific domains. One prominent task is social intelligence, in which models present complex social scenarios that require an understanding of cultural values, ethical principles, and potential social implications. Another line of research is the evaluation of large multimodal models (LMMs) and large visual language models (LVLMs). Recently, we have seen more customized use of LVLMs as judges to assess emerging tasks such as code comprehension, legal literacy, game development, marine science, healthcare conversations, debate judgments, retrieval enhancement generation, and more.

Alignment

For large language models, the language modeling goal is to pre-train the model parameters through word prediction and lacks consideration of human values or preferences. To avoid these unexpected behaviors, human alignment has been proposed to align the behavior of large language models to human expectations. However, unlike initial pre-training and adaptive tuning (e.g., command tuning), this alignment requires consideration of very different criteria (e.g., helpfulness, honesty, and harmlessness). It has been shown that ALIGNMENT may impair the synthesizing ability of large language models to some extent, referred to as ALIGNMENT TAX in the related literature.
Aligenment tuning is an important technique for combining LLM with human preferences and values. A key component of this process is the collection of high-quality pairwise feedback from humans, which is essential for reward modeling or direct preference learning.

Larger Models as Judges

One intuitive idea for employing LLMs as judgments in alignment tuning is to use feedback from larger, more powerful LLMs to guide smaller, less capable models.Bai et al. made the first proposal to use feedback from AIs to build harmless AI assistants. They used synthetic preference data based on pre-trained language model preferences to train reward models.
Self-Judging

Another work aims at self-improvement using preference signals from the same LLM.Yuan et al. first introduced the concept of self-rewarding LLMs, in which pairwise data is constructed by letting the LLM itself act as a judgment.

Retrieval

The role of LLM as a judge in retrieval includes both traditional document ranking and more dynamic, context-adaptive retrieval enhancement generation (RAG) methods. In traditional retrieval, LLMs improve ranking accuracy through advanced cueing techniques that allow them to rank documents by relevance with minimal tagging data. Complementing this, the RAG framework leverages the LLM's ability to generate content based on retrieved information to support complex or evolving applications where knowledge integration is critical.

Traditional Retrieval

Recent studies have explored the role of LLMs as document ranking judges in information retrieval, aiming to improve ranking accuracy and reduce reliance on large amounts of training data. For example, Sun et al. explored the potential of generative LLMs such as GPT-4 for relevance ranking in information retrieval. They proposed an alignment-based approach to rank paragraphs by relevance, which instructs the LLM to output an ordered arrangement of paragraphs, thus improving ranking accuracy.
Retrieval-Augmented Generation (RAG)

Recent developments in retrieval-enhanced generation (RAG) explore the ability of LLM to self-assess and self-improve without the need for annotated datasets or parameter tuning.Li and Qiu introduce the Memory of Thought (MoT) framework, a two-phase self-reflective model that autonomously enhances LLM's reasoning. In the first phase, the model generates high-confidence reasoning on unlabeled datasets and stores them in memory. In the test phase, the model recalls these memories by judging the relevance of each of them to the question at hand and selects the most relevant as a demonstration.

Reasoning

Reasoning Path Selection

Wei et al. introduced the concept of Chain of Thought (CoT) prompts to encourage models to generate stepwise reasoning processes. Whereas other more complex cognitive structures have been proposed to improve LLM reasoning, a key challenge is to choose a reasonably reliable reasoning path or trajectory for LLM to follow. To address this issue, LLM-asa-judge has been adopted in many works.
Reasoning with External Tools

Yao et al. (2023b) first propose to use LLMs in an interleaved manner to generate reasoning traces and task-specific actions. Reasoning traces help the model to judge and update action plans, while actions enable it to interact with external sources. Further, Auto-GPT was introduced by (Yang et al., 2023) to deliver more accurate information with LLM-as-a-judge for tool-using. By equipping with a range of external complex tools, LLMs become more versatile and capable, improving planning performance by judging and reasoning which tools to use. Sha et al. (2023) explore LLMs’ potential in reasoning and judging, employing them as decisionmaking components for complex autonomous driving scenarios that require human commonsense understanding. Zhou et al. (2024b) utilize a selfdiscovery process where LLMs perform judgment based on the given query and select the most feasible reasoning structure for the following inference stage.

Benchmark: Judging LLM-as-a-judge

（1）General Performance

Benchmarks focusing on overall performance are designed to assess the overall competence of the LLM in a variety of tasks. These benchmarks typically measure consistency, accuracy, and relevance to human judgment.

（2）Bias Quantification

Reducing bias in LLM judgments is critical to ensuring fairness and reliability. Typical benchmarks include EVALBIAS-BENCH (Wang et al., 2024c) and CALM (Ye et al., 2024a), with an explicit focus on quantifying bias, including bias that emerges from comparison and robustness in adversarial conditions. In addition, Shi et al. (2024a) assessed metrics such as positional bias and percent agreement in a question-and-answer task.

（3）Challenging Task Performance、

Benchmarks designed for difficult tasks push the boundaries of LLM assessment. For example, Arena-Hard Auto (Li et al., 2024j) and JudgeBench (Tan et al., 2024a) select harder problems based on LLM performance in session QA and various reasoning tasks, respectively.CALM (Ye et al., 2024a) explores alignment and challenging scenarios using metrics such as separability, consistency, and accuracy of being hacking accuracy, among other metrics to evaluate performance in manually recognized hard datasets.

（4）Domain-Specific Performance

Domain-specific benchmarks provide task-centered assessments to evaluate the effectiveness of the LLM in specialized settings. Specifically, Raju et al. (2024) measure separability and consistency across tasks using metrics such as Brier scores to provide insight into specific domains such as coding, medicine, finance, law, and mathematics.CodeJudgeEval (Zhao et al., 2024a) specifically evaluates LLMs for judging code generation through execution-centered metrics such as accuracy and F1 scores. This idea has also been adopted by several code summarization and generation evaluation efforts (Wu et al., 2024b; Yang et al., 2024; Tong and Zhang, 2024)

（5）Other Evaluation Dimensions

In addition to general performance and bias quantification, several benchmarks address additional assessment dimensions necessary to evaluate specific aspects of using an LLM as a judge.

Challenges & Future Works

Bias & Vulnerability

The use of LLM as judge essentially frames assessment as a generative task, presenting significant challenges related to bias and vulnerability. These biases typically stem from the model's training data, which often embeds social stereotypes associated with demographic identities such as race, gender, religion, culture, and ideology. When LLMs are deployed for different judgment tasks, such biases can significantly compromise fairness and reliability.

Bias

（1）Order Biasis a prominent issue, with the order of candidates affecting preferences. This bias can distort assessment results, especially in pairwise comparisons, when the quality gap between competing responses is small.

(2) When LLM favors outputs generated by the same model, compromising objectivityEgocentric Bias. This problem is particularly evident when the same model is used to design the assessment metrics, resulting in over-scoring of the original model output.

（3）Length Biasis another common challenge that skews assessments by disproportionately favoring longer or shorter responses.

(4) Other bias that further complicate LLM assessment. For example.Misinformation Oversight Biasreflecting a tendency to ignore factual errors.Authority Biaspreferring statements from sources of perceived authority.Beauty Biasprioritizing visually appealing content over substantive quality.Verbosity Biasshow a preference for longer explanations, often equating redundancy with quality, which can mislead the judgmental process, in addition toSentiment Biasdistorts assessments based on emotional tone in favor of positively worded responses.

Vulnerability

(1) LLM judges are also highly vulnerable to adversarial manipulations. techniques such as JudgeDeceiver, for example, highlight the risks posed by optimization-based cue injection attacks, where carefully crafted adversarial sequences can manipulate LLM judgments to support specific responses.

Existing Solution

To address these biases and vulnerabilities, frameworks such as CALM (Ye et al., 2024a) and BWRS (Gao et al., 2024b) provide systematic approaches to bias quantification and mitigation. Techniques such as multiple evidence calibration (MEC), balanced position calibration (BPC), and human-in-the-loop calibration (HITLC) have been shown to be effective in aligning model judgments with human assessments while reducing positional and other biases (Wang et al., 2023c). In addition, cognitive bias benchmarks like COBBLER have identified six key biases, including significant bias and the follow-through effect, that need to be systematically mitigated in LLM assessments (Koo et al., 2023b).

Future Direction

(1) A promising direction isIntegration of the Retrieval Augmentation Generation (RAG) framework into the LLM evaluation process(Chen et al., 2024d). By combining generative and retrieval capabilities, these frameworks can reduce biases, such as self-preference and factual issues, by grounding assessments in external, verifiable data sources.

(2) Another promising avenue is theUsing bias-aware datasets, such as OFFSETBIAS, to systematically address the biases inherent in the LLM-as-judge system (Park et al., 2024). Integrating such datasets into the training pipeline allows LLMs to better distinguish between surface quality and substantive correctness, thereby enhancing fairness and reliability. Exploring fine-tuned LLMs as scalable judges, as exemplified by the JudgeLM framework, represents another interesting direction.

(3) In addition.advancing zeroshot comparative assessment frameworksoffers great promise (Liusie et al., 2023). These frameworks can refine pairwise comparison techniques and implement de-biasing strategies to improve fairness and reliability across different assessment domains without the need for extensive on-the-fly engineering or fine-tuning.

(4) Finally, likeJudgeDeceiver-resistant calibrationrespond in singingadversarial phrase detection strategiesSuch an approach needs to be further explored to protect the LLM-as-a-judge framework from attacks.

Dynamic & Complex Judgment

Early work on LLM-as-judge typically used a static and straightforward approach that directly prompts the judge LLM for evaluation. More recently, more dynamic and complex judgment pipelines have been proposed to address various limitations and improve the robustness and effectiveness of LLM-as-judge.

Existing Solution

(1) One approach in this direction follows the concept of "LLM-as-a-examiner", in which the system dynamically and interactively generates questions and judgments based on the performance of candidate LLMs. Other work has focused on making judgments based on the outcome of battles and debates between two or more candidate LLMs. These dynamic judgment methods have largely improved the judge LLM's understanding of each candidate and potentially prevented data contamination problems in the evaluation of LLMs.

(2) In addition, building complex and sophisticated judgment pipelines or agents is another popular area of research. These approaches typically involve multi-intelligence collaboration, as well as elaborate planning and memory systems that enable judge LLMs to handle more complex and diverse judgment scenarios.

Future Direction

(1) One promising direction is to equip LLMs with human-like judgment. These could be designed to draw on insights from human behavior when making judgments, such as anchoring and comparing, hindsight and reflection, and meta-judgment.

(2) Another interesting avenue is to use LLM to develop aadaptive difficulty assessment system. The system will adjust the difficulty of the questions based on the candidate's current performance. This adaptive and dynamic system could address a significant limitation in LLM assessment, as static benchmarks typically do not accurately assess LLMs with varying abilities.

Self-Judging

LLM-based evaluators, such as the GPT-4, are widely used to evaluate outputs but face significant challenges.

(1) In particularEgocentric BiasThis is because models prefer their own responses to those from external systems. This self-preference undermines fairness and creates a "chicken or egg" dilemma: a strong evaluator is essential to the development of a strong LLM, but the advancement of LLM depends on a fair evaluator.

(2) Other issues includeSelf-Enhancement Bias, where the model overestimates its own output, and theReward Hacking, where the over-optimization of specific signals leads to a less generic evaluation.

(3) In addition, forStatic Reward Modelsdependence limits adaptation, and thePositional and Verbosity BiasesEqual bias distorts judgment by favoring reaction order or length over mass.

(4) The high cost and limited scalability of human annotation further complicates the creation of dynamic and reliable assessment systems.

Future Direction

(1) A promising direction for future research is the development of collaborative assessment frameworks such as thePeer Rank and Discussion (PRD). These frameworks utilize multiple LLMs to collectively assess outputs, use weighted pairwise judgments and multi-round dialogues to reduce self-enhancement bias, and bring assessments closer to human standards.

(2) Another interesting avenue is to use theadoption of Self-Taught Evaluator frameworks, which generates synthetic preference pairs and inference traces to iteratively refine model evaluation capabilities. This approach eliminates the reliance on expensive manual annotations while ensuring that the evaluation criteria are adapted to evolving tasks and models.

（3）Self-Rewarding Language Models (SRLM)integration provides another promising path. By employing iterative mechanisms such as Direct Preference Optimization (DPO), these models continue to improve their instruction adherence and reward modeling capabilities, mitigating problems such as reward hacking and overfitting.

(4) On the basis of the SRLM.Meta-RewardingThe use of the mechanism introduces a meta-judge role to assess and refine the quality of judgments. This iterative process addresses biases such as redundancy and positional bias and enhances the ability to align and evaluate complex tasks.

(5) Finally, utilizing synthetic data creation to generate contrast responses provides a scalable solution for training evaluators. By iteratively refining the assessment of synthetic preference pairs, models can incrementally improve their robustness and adaptability. Combining these methods with different benchmarks, multifaceted evaluation criteria, and human feedback can ensure that assessments are fair, reliable, and in line with human expectations in all domains.

Human-LLMs Co-judgement

As mentioned earlier, bias and vulnerability in LLM-as-a-judge can be addressed by human involvement in the judgment process for further intervention and proofreading. However, only a few studies have focused on this approach.

Existing Solution

(1) Wang et al. (2023c) introduced thehuman-in-the-loop calibration, which uses balanced positional diversity entropy to measure the difficulty of each example and seeks human help when necessary.

(2) In the context of relevance judgments, Faggioli et al. (2023) proposed anhuman-machine collaboration spectrumthat categorizes different relevance judgment strategies based on how much humans rely on machines.

Future Direction

(1) Withdata selection(Xie et al., 2023; Albalak et al., 2024) has become an increasingly popular area of research for improving the efficiency of LLM training and inference, and it also has the potential to augment LLM evaluation.LLMas-a-judge can derive insights from data selection, enabling judge LLMs to act as critical sample selectors, selecting a small sample for human annotators to evaluate based on specific criteria (e.g., representativeness or difficulty) to select a small sample for human annotators to evaluate.

(2) In addition, the development of human-LLM shared judgment can benefit from mature human-in-the-loop solutions in other domains, such as data annotation (Tan et al., 2024b) and active learning (Margatina et al., 2023).