basic concept
this ismanual assessment The first in a series of articles, Foundational Concepts, the full series includes.
- basic concept
- human annotator
- Tips & Hints
What is a manual assessment?
Human evaluation means having a human evaluate how well the model outputs a response.
This paper discusses all a posteriori evaluations, where the model has been trained and given a task for a human to evaluate.
Systematic assessment
There are 3 main ways to systematize manual assessment:
If you have on handNo readily available data sets, but still want to test some of the model's capabilities, a manual assessment can be used: provide a task description and scoring guide (for example:Attempt to interact with the model to force the model to output inappropriate language, i.e., contain offensive, discriminatory, violent, etc. If the model outputs inappropriate language, it scores 0 and vice versa.
), as well as test models that can be interacted with, can then be manually manipulated and scored by the annotator, along with a list of reasons for the scoring.
If you have on handData sets already available (e.g.Collected a set of prompts and ensured that they did not force the model to output inappropriate responses
), you can model the prompt inputs yourself to get the outputs, and then provide the input prompts, output responses, and scoring guides together to the annotator for evaluation (If the model unexpectedly outputs improperly, the score is 0, and vice versa for 1
)。
If you have on handBoth datasets and scoring results, which allows manual annotators to passerror message The methodology (This same approach can be used as an assessment system for the above scenario) to review the assessment. This step is important when testing a new assessment system, but the technical level of measurement belongs to the evaluation of the assessment system and is therefore slightly beyond the scope of this paper.
Notes:
- To evaluate a deployed production model, consider manual A/B testing and feedback.
- AI audits (Model External System Evaluation) is also a manual evaluation approach, but is beyond the scope of this paper.
Informal assessment
There are two other less formal approaches to human-based assessment methods:
Vibes check is a method of manual evaluation using non-public data to test on multiple scenario use cases (e.g., code programming, literature creation, etc.) to capture the overall effect. Evaluation results are often shared on Twitter and Reddit as anecdotal evidence, but they are susceptible to subjective cognitive bias (in other words, people tend to believe only what they believe). Nonetheless, these results can still be used asA good starting point for your own testing。
Arenas is a crowdsourced manual evaluation method for ranking the performance of multiple models.
A well-known example isLMSYS Chatbot Arena EvaluationUsers in the , community talk to multiple models to distinguish the best from the worst and vote on them. The total votes are aggregated into an Elo ranking (the ranking of this multi-model competition) to determine the "best model".
Advantages and disadvantages of manual assessment
Advantage:
- dexterity: Manual assessments are suitable for almost all tasks as long as the assessment is defined clearly enough!
- No data contamination: The manually written question prompt will not cross over with the training set (hopefully).
-
Relevance to human preferences: This one is obvious, after all, it's scored by manual criteria.
Note: When conducting a manual assessment, try to ensure that there is a diversity of annotators to ensure generalization of the results.
Disadvantage:
- first impression bias: Manual annotators tend to base their work onfirst impression to assess the quality of the answer, sometimes ignoring the examination of the facts.
- tone bias: Crowdsourcing annotators are particularly sensitive to tone of voice, and are prone to underestimate factual or logical errors in sentences that are more assertively formulated. For example, if a model says something wrong in a confident tone, it may be difficult for the annotator to detect it, which in turn leads to high ratings for models with more confident output. In contrast, expert annotators are less affected by tone bias.
- Self-preference bias: People sometimesFavoring answers that cater to your own point of view, rather than the factually correct answer.
- Identity background bias: People from different identity backgrounds have different values, which may lead to significant differences in performance when evaluating the model (e.g., in the model outputs ofImproper response assessment in which there is a bias in the understanding of what constitutes an improper expression).
Systematic manual assessment
Advantages of systematic manual assessment (especially paid manual):
- High-quality data: Test sets can be tailored to the evaluation task to further support you in developing (e.g. the need to develop preference models) and evaluating the model.
- data privacy: Paid annotators (especially insiders) are often very focused on data security, and instead the closed-source API models evaluated by LLM have less data privacy because you need to send personal data to external services.
- interpretability: Markers will clearly state the reasons for marking when marking.
Drawbacks:
- Higher costs: Of course you need to pay the annotator. Even to optimize the evaluation guide, you need multiple iterations, which will make it even more expensive.
- Poor scalability:: Unless your evaluation task relies heavily on user feedback, manual evaluation methods really don't scale very well, because you need to redeploy people (and pay for them) every time you do a new round of evaluation.
- Low repeatability: Unless you can guarantee that the same annotators are involved in each assessment and that the scoring criteria are completely clear, the results of the assessment may not be accurately reproduced by different annotators.
Informal manual assessment
Advantage:
- Lower cost: Community members participate voluntarily and pay less in fees.
- Discover Edge Use Cases: With fewer constraints, spontaneous creativity by members may uncover some interesting edge use cases.
- Highly scalable: Assessments scale better and have lower barriers to participation as long as enough community members volunteer to participate.
Disadvantage:
- highly subjective: Due to the community members' ownCultural limitationsIt is also difficult to maintain consistency in scoring, despite consistent standards. However, the "wisdom of the crowd" effect (see Galton's Wiki page) may smooth out this problem in a large number of rating polls.
- Rating preferences are not representative: The over-representation of young Western males in the Internet technology community may lead to a significant imbalance in assessed preferences, which is not consistent with the tastes of what is in fact the general public, and can therefore affect the accuracy of the assessment.
- Easily manipulated: If the crowdsourced annotators you hire aren't screened, it's easy for a third-party organization to manipulate them to cause anomalies in the model's scoring (e.g., high scores), especially if the model's writing style is unique.
Link to original article./huggingface/evaluation-guidebook/blob/main/contents/human-evaluation/
Translators: SuSung-boy, clefourrier, adeenayakup