Demystifying Prompt Series 37. Multiple Strategies for Deciding When to Network Prior to the RAG

Previously we have discussed recall diversity, recall information quality and density, and calibration's post-processing type of RAG, respectively.The part of the front-loaded judgment model answering whether or not to go for a RAG we have previously mentioned only the self-contradictor and the self-rejector scenarios. In this chapter we add a few moreRAG Prior Judgment Program。

For each scenario, we will pick 1 paper and focus on the part of the paper related to retrieval decision making. The scenarios include fine-tuning the model to make decisions, judging based on the confidence of the model answer, judging based on the KNN of the question, and using a small model agent to answer the answer, etc. The scenarios are summarized at the end of this paper. The classification of each scheme is summarized at the end of this paper~

Model fine-tuning

SELF-RAG: LEARNING TO RETRIEVE, GENERATE, AND CRITIQUE THROUGH SELF-REFLECTION

When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively

SELF-RAG is a scheme based on fine-tuning to dynamically determine whether the next text segment needs to be RAG'd. The paper defines the following four RAG-related reflective special characters

Here we are centrally concerned with the [Retrieve] character, which is responsible for deciding whether or not to perform a retrieval enhancement before the next sentence of reasoning.
Since this is a fine-tuning scheme, it is central to say how training samples with [Retrieve] tags and retrieved content are constructed, with samples in the following format.

In order to get the above Interleave samples, in fact, you can directly use GPT4 for labeling, but the paper considers the inference cost of GPT4 is too high, so based on the 4K samples labeled by GPT4, fine-tuned the model of Llama2-7B, and then use the Critic model of the 7B labeled with a larger scale of 20K samples used to train the Generator. Let's look at the labeling process of Interleave samples, and the related Prompt of [Retrieve] labeling, based on the original data input and output, and the following operations on the output section

Critic first determines whether retrieval is required based on the Input, and if the prediction [Retrieve=NO] then only the Output [IsUse] is judged, and the labeled prompt for GPT4 is as follows

If Critic determines that Input needs to be retrieved, it outputs [Retrieve=YES] and inserts it at the beginning of the Ouput's sentence, then retrieves the content based on Input and Output
After obtaining the retrieved content, the Ouput is divided into sentences, and each sentence is based on the input, the previous sentence reasoning, and the initial retrieved content to jointly determine whether this sentence needs additional information retrieval, and if it does, then [Retrieve=YES] is inserted at the beginning of the sentence, otherwise, [Retrieve=NO] is inserted, and the labeled prompt of GPT4 is as follows

If [Retrieve=YES], additional information is retrieved using the input and previous sentence reasoning, and [IsSUP] and [IsREL] are predicted separately for each retrieved content, retaining the retrieved content with the highest scores [IsREL=Relevant] and [IsSUP=Fully Supported/Partially Supported], and insert the retrieved content as a paragraph after [Retrieve].
Finally, at the end of the sentence to determine the [ISUSE] of the content of the inference

To train Generator directly based on the above labeled samples, it is necessary to expand the special characters into the model word list first, and then MASK off the retrieved content when calculating the loss function during the training process. In this way, the model can directly decode the above four special characters in the inference process, and based on the characters to decide whether to retrieve, whether to use the retrieved content and so on.

Model Response Confidence

FLARE: Active Retrieval Augmented Generation

The full name of FLARE here is Forward-Looking Active REtrieval augmentation, which means that after each sentence of the model's reasoning, let the model determine whether the next sentence needs to use the RAG or not, and if it does, then it generates a retrieval query, searches for the content, and continues to reason based on the content that has already been reasoned out earlier, the user's question and the newly retrieved content. content, the user's question and the newly retrieved content, to continue reasoning.

Instead of going into detail about this framework for dynamic RAG by sentence granularity, the focus here is on how each step is to determine whether to use RAG or not. the paper tries two scenarios:

Instruct: Similar to ToolFormer, the prompt instruction causes the model to generate a new Seach(query) command after each sentence of inference. The paper found that this RAG decision scheme based on prompt+few-shot commands does not work well.

ConfidenceThe model can reason directly in each sentence, and then based on the confidence of the model reasoning to determine whether it is necessary to retrieve, and then based on the content of the retrieval to re-inference of the sentence generation. Here, the inference confidence uses the model to reason whether the generation probability of each token in the sentence exceeds the threshold, if they all exceed the threshold, then the sentence generated by the model inference is retained, otherwise it goes to retrieval generation.

If the confidence-based judgment model is not sure about the answer and a search is needed. The paper gives two schemes for generating a search query:

mask sentence: directly delete the tokens with probability less than the threshold from the sentences generated by the above model inference, and then use the remaining tokens as query for retrieval. The purpose of masking the low threshold token is to avoid wrong answers affecting the retrieval quality. For example, if a user asks which countries have experienced serious economic crisis, and the model answers "France has experienced economic crisis", where the probability of France is lower than the threshold, then if France is not deleted, the retrieval focus will be diverted from the economic crisis to France, resulting in the retrieval of ineffective content. However, for sentences with low probability, the masking scheme may result in no valid token for retrieval.
generate question: in addition to the mask, the paper also provides a query generation scheme, which takes each span in the sentence with a probability less than a threshold and generates a retrieval query by means of a large model command, as follows

The paper compares, the scheme of dynamic RAG based on next sentence inference confidence (FLARE), recall based on fixed-length historical tokens, and recall based on a fixed historical single sentence, all of which are better with FLARE, mainly in two sections

Using a historical sentence is not as good a reflection of the model's intent as using the next sentence of the model's reasoning
Dynamic RAG utilizes internal and external knowledge more efficiently than fixing for recall.

Miniaturization Agent Answer

Small Models, Big Insights: Leveraging Slim Proxy Models to Decide When and What to Retrieve for LLMs

In the Bacchanal paper, the let small model is used, here Llama2-7B answers to the user question (Heurisitic Answer), and then the Judgement Model is used to make a combined judgment on the question and the model answer, and ultimately outputs the label of whether or not it needs to be retrieved. If retrieval is required, then the RAG process is followed for Llama-70B to make the final question answer.

where the inputs to the Judgement model are as follows

Then the core is actually in the Judgement model training data construction, here the paper constructed samples on the existing QA dataset, the dataset composition is as follows.

Here the paper uses only a sample of shorter true answers and marks the sample by calculating the match rate between the answers and the miniaturized responses, with a high match rate (high quality of answers) being positive otherwise negative, and then uses that sample set to fine-tune the Judgement model, which is also llama2-7B here.

The paper does not give more analysis of the Judgement Model, such as which responses are judged to be known by the model and which are judged not to be known. Personally, I was actually a little confused about what features were learned from the JudgeMent Model based only on model responses. But the idea of using a smaller model as a Proxy model for pre-inference can be drawn from, although there may be a problem of different knowledge spaces between the larger and smaller models, the subjective feeling is that the knowledge space of the smaller model is more likely to be a subset of the larger model, so the problem will not be too big.

Problem Nearest Neighbor Determination

SKR-KNN: Self-Knowledge Guided Retrieval Augmentation for Large Language Models

The paper tries a variety of options for determining whether the model knows the answer to the question, including asking the model directly "Do you know? and asking "Do you know?" with a few-shot. As well as small model binary classification, but finally verified more reliable or based on the problem of the nearest neighbor to discriminate KNN scheme. Haha in the finale, because this is my personal preference for part of the program, the specific implementation consists of two steps

The first step constructs the KNN sample setThe paper uses the TabularQA and CommonsenseQA datasets, and each question is answered separately by letting the model answer it by itself, as well as by using vector retrieval to recall the Wiki and then retaining the Topk above for the model to answer based on the above. Then by comparing the difference between the Exact Match of the two responses and the true answer, to determine for the question, the model actually knows or does not know, this step is called collecting SELF-KNOWLEDGE.

Here QA question is actually a simplification of the real scene, the real world questions are mostly open Q&A, there is no correct and unique answer, this time to collect the training set, to determine whether the model is based on the internalized knowledge to answer better, or add RAG retrieval to enhance the answer effect is better, I think of the effect can be achieved with the help of RM, or some of the effect of JudgeLM scoring.

The second step is to discriminate the new problem based on the sample set, the paper simply uses vectors such as SimCES to encode new questions and questions within the sample set, each of which retrieves K questions from Drunken Acacia, and then decides whether or not to go for RAG retrieval for the new question based on the labeling of those K questions [know vs. don't know].

The paper only evaluates that KNN will outperform Bert classification, large model prompt, etc., but in fact, apart from the effects, the reason why I am personally optimistic about this scheme is thatThe KNN scales in real time and can be continuously updated incrementally based on the effect of answering online questions, supplementing the set of positive and negative samples.A problem with KNNs, however, is that the relevance of some of the problems cannot be recognized by generalized semantic similarity, e.g., the complexity of the problem is independent of the generalized semantics, which we will mention in the next chapter.

Here also by the way, the thesis also tried to let the big model itself to answer whether it knows the question of this scheme, the thesis tried the following kinds of Prompt. after all, I personally think that the judgment of whether to go RAG is not a single strategy, but a combination of several strategies, based on the big model Prompt of the scheme is also one of them, may not be used directly like the thesis, so that the cost of inference is too high. But it can be used in combination with the user's question. This can answer the direct answer, can not answer or the above other strategy judgment model does not know, and then go RAG.

summarize

Let us conclude by briefly categorizing the schemes in the two chapters that refer to RAG-front retrieval decisions

input-based
- Unsupervised classification: using KNN nearest neighbor judgment based on historical problems
- Supervised classification: fine-tuning the model to determine when retrieval is needed
- Verbose: based on the instruction to let the model answer itself whether the question needs to be retrieved or not
- RLHF: Letting the model judge itself and reject by alignment
Based on inputs and outputs (outputs can be complete answers or next sentence reasoning)
- Verbose: Let the model answer first, and then use the command to let the model decide whether it needs to retrieve based on the question and answer together.
- Contradicotry: determining whether a model may not know based on contradictions in single-model, multi-model responses
- Confidence Entropy based on model responses
More detail optimization
- Decompose: the original question for the perspective of the split and separate judgment, but also can be divided into sentences for dynamic retrieval
- Proxy: input- and output-based judgments that can use small models as proxies to optimize the speed of inference

To see a fuller compendium of papers related to large models - fine-tuning and pre-training data and frameworks - AIGC applications, move to Github >> DecryPrompt