Demystifying Prompt Series 39. RAG's Optimizing Fine-Tuning with LLMs

The part of the RAG where we have previously discussed diversity, information density and quality of information recall has mainly focused on the recall, fusion, and coarse ranking parts. In this chapter, we will focus on the fine-tuning part. The main difference between coarse and fine ranking lies in the balance of efficiency and effectiveness, coarse ranking model complexity is lower, you need to carry on the lower complexity of the model, in a substantial reduction in the size of the recall candidates on the basis of the sorting consistency with the fine ranking as much as possible, to ensure that the fine ranking of high-quality content will not be filtered. Whereas the refined ranking model is of higher complexity, a more complex model can be used to fit the final target ranking as closely as possible. In the RAG task, the ultimate goal is that the candidate content can answer the question, and the objective assessment is the inference citation rate.

There are several commonly used training objectives for fine-tuning models, including ListWise for global optimization, pointwise for fitting ctr and other direct objectives independently for each item, and pairwise for comparative optimization, etc. In the sorting module of RAG, there are also several papers that have attempted to use the above schemes for the sorting objectives and the labeling of the samples, and all of the following schemes are possible Directly use the large model to do fine ranking, or use the large model to build fine-tuned samples to train the small model~.

PointWise

HELM：Holistic Evaluation of Language Models

UPR：Improving Passage Retrieval with Zero-Shot Question Generation

First, pointwise, which means that for each piece of recalled content, an independent judgment is made about how well that content answers the query, both the relevance of the query and the content.

Then the most intuitive solution is to input the query and content together into the big model for the model to determine whether they are relevant or not, i.e., the FEW-SHOT directive discrimination scheme used in HELM. Use the following instructions to sort the content by the YES and NO token probabilities reasoned by the big model.

Given a passage and a query, predict whether the passage includes an answer to the query by producing either ‘Yes‘ or ‘No‘.

{{few_shot}}

Passage: {{passage}}
Query: {{query}}
Does the passage answer the query?
Answer:

If the above program is forEach content candidate computes the joint probability of P(query,content), that consider query is fixed for all content, then we can also choose to compute the conditional probability of P(content|query).

However, considering the content will be noisy, with both relevant and irrelevant information, so we relax bayesian's assumption and approximate P(content|query) by calculating P(query|content), that is, we use the probability of query given content to measure the similarity between query and content.

The UPR thesis uses the Big Model directly, theBased on the following prompt template, the probability of decoding each word of the query is computed averaged as an approximation of P(query|content), because it can be decoded in parallel, this scheme is not too slow even though it uses a large model.

Passage: {{passage}}. Please write a question based on this passage.

As well as doesn't it look familiar, with the previous in theLLM Agent's Revisiting RAG's Recall Information Density and QualityLongLLMLingua, the long text compression scheme mentioned in this article, is along the same lines. Except that LongLLMLingua uses the instruction "we can get the answer to this question in the given documents".

Listwise

RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models

RankGPT：Is ChatGPT good at search? Investigating large language models as re-ranking agent

/sunnweiwei/RankGPT

RankGPT proposes a large model ordering scheme based on permutation, where the model inputs multiple content above and uses a command to ask the LLM to output the content serial numbers in order based on the relevance of the content, the prompt template is as follows

This is RankGPT, an intelligent assistant that can rank passages based on their relevancy to the
query.
The following are {{num}} passages, each indicated by number identifier []. I can rank them based
on their relevance to query: {{query}}
[1] {{passage_1}}
[2] {{passage_2}}
(more passages) ...
The search query is: {{query}}
I will rank the {{num}} passages above based on their relevance to the search query. The passages
will be listed in descending order using identifiers, and the most relevant passages should be listed
first, and the output format should be [] > [] > etc, ., [1] > [2] > etc.
The ranking results of the {{num}} passages (only identifiers) is:

And the so-called Permutation is to take into account the limited length of the LLM's supra, so the N supra is grouped, and there is an overlapping division between groups, and the model is allowed to sort only one set of data at a time. listwise effect is good, but it also has a number of drawbacks

Reasoning about the limitations on the length of the above
Input content order affects inference results
Output reasoning is time-consuming
Poor robustness of multiple predictions can lead to ordering conflicts
Higher requirements for modeling capabilities

Pairwise

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting

Compared to the previous pointwise dependency model where the output probabilities are WELL-calibrated and the listwise dependency model has better ordering capabilities, pairwise relaxes the modeling requirements quite a bit.

The paper uses the following PROMPT to have the model compare two contents and output the result of A/B. Here the paper also uses prob probability, but as opposed to pointwise which uses the prob probability of all the contents for direct ordering, pairwise will compare two by two, while swapping the order of the contents each time, to get the two comparisons AB, BA of the token inference probabilities.

Given a query {query}, which of the following two passages is more relevant to the query?
Passage A: {document1}
Passage B: {document2}
Output Passage A or Passage B:

How to use the results of the above two-by-two comparison, the paper gives three sorting schemes: all pairs, heap sort and bubble sort.

All pairs use the results of two-by-two comparisons between contents to score all contents, when two-by-two comparisons are made, if the (AB),(BA) models both give a consistent judgment of A>B then A scores one point, and if the two results are contradictory then A,B scores 0.5 points each. So in fact, the paper puts probability scoring for localized ranking processing, reducing the result bias brought by the model predicting probability is not well-calibrated. Of course the disadvantage of All Pairs is obvious, doing an overall ranking is (N^2) request complexity.

Heap sort is the use of sorting algorithms, the complexity is reduced to O (NlogN), and bubble sort is the implementation of the logic of bubble sort, while taking into account the fine ranking is often only retained in the Top-k highest ranked item can be, so only need to be two by two comparison of the order of the exchange of K times, so the complexity is O (NK).

The effect of the paper and the previous point-wise UPR, list-wise RankGPT have been compared, in addition to the gpt4 bar, the use of 20B FlAN-UL2 can be basically similar to the effect of gpt3.5. And compare listwise, pair-wise comparison scheme.The sensitivity to the order of input content is low, as well as the requirement for modeling capabilities, and small models can perform comparably to large models.

SetWise

A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models

/ielab/llm-rankers

The last article introduces setwise, which is actually a combination of listwise and pairwise above, it uses the scoring method of listwise, and also borrows the idea of pairwise using heap sort and bubble sort to filter the topK documents, and in fact, it also uses the idea of pointwise using a large model to output the probability distribution. Ideas. Haha feeling is actually kind of an engineering implementation of the improvements, briefly

The previous listwise scoring requires a large model to sort all the content at once, not only the length of the above limitations, the impact of the order of the input content on the results of the slower reasoning, but also for the model itself is more difficult, there will be many times the order of the reasoning of the contradictory situation. Therefore, it may be worthwhile to divide the content into many groups, reduce the input length, and at the same time, change the output sorting to output the most relevant document serial number within the group, so as to reduce the inference delay, but also at the same time, you can use the logits distribution of the output token to score multiple documents within the group.

Then based on the scoring within the group, the same is the use of bubble sort, compared to the implementation of the logic of pairwise each time the comparison of the exchange of large models need to be calculated to compare the relevance of the two documents, setwise can be compared to the group of 3-4 documents at a time, in the efficiency of the efficiency will have to be further improved, such as the following figure

Effectively the paper uses NDCG@10 as an evaluation metric (the upper corner labels are setwise significant metrics improvement relative to which of the previous methods), and Figure 1 below shows Flan-t5 with different model sizes using different sorting schemes for theEffectiveness & Efficiency ComparisonThe pointwise-latency is the lowest but also the least effective, while the pairwise is the best but also has the highest latency. Listwise-likelihood and setwise heapsort seem to be the more appropriate schemes in terms of balancing sorting effectiveness and latency.

To see a fuller compendium of papers related to large models - fine-tuning and pre-training data and frameworks - AIGC applications, move to Github >> DecryPrompt

colored egg

Recently I wanted to make my own blog site, so I used the legendary one sentence can build a fully functional web page, web page simulation generatorTried, the following is the effect picture, feel the effect is not poke oh, everyone think le ~