Open-RAG: Integrating Open Source LLM Models into Efficient RAG Models

This paper is a core distillation of a published paper and is intended for scholarly communication. Please contact the owner of the number for any infringement issues so that it can be removed.

discuss a paper or thesis (old): Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models

Paper Address:/abs/2410.01782
Thesis Code:

innovativeness

\({\tt Open-RAG}\) The following features are available:

Combine an arbitrarily denseLLMtransformed into a parameter-efficient sparse expert mixture (sparse mixture of experts，MoE) model that can handle complex reasoning tasks, including single- and multi-step queries.
Specifically train the model to cope with challenging disturbances that appear to be relevant but are actually misleading, while expanding only in the adapterMoE, maintaining the scale of the model.
By combining structural learning, architectural transformation, and reflection-based generation, latent embedded learning is utilized to dynamically select relevant experts and efficiently integrate external knowledge for more accurate and context-supported estimation of response generation and its effectiveness.
A hybrid adaptive retrieval approach is adopted to determine the necessity of retrieval and to balance performance enhancement with inference speed.

Content overview

Retrieval enhancement generation (Retrieval-Augmented Generation，RAG) can improve large language models (Large Language Models，LLMs) accuracy, but existing methods tend to exhibit limited reasoning power in effectively utilizing retrieved evidence, especially when using open sourceLLMsHours.

thesis statement\({\tt Open-RAG}\) The aim is to enhance the open sourceLLMscenterRAGthe reasoning power of the In order to control the open sourceLLMsbehavior, generating a more context-supported response that employs the behavior from theSelf-RAG's reflection-based generation approach enhances the output vocabulary (corresponding to the blue portion in the figure above) with four special types of reflective markers: retrieval (Retrieval), relevance (Relevance), the basis (Grounding) and practical (Utility）。

commander-in-chief (military)\({\tt Open-RAG}\) LLM is defined as a model\(\mathcal{M}_{G}\) The model, given an input query\(q\) case, the target generates a file containing the\(m\) Output Sequence of Markers\(o = [o_1, o_2, ..., o_m]\) The process is as follows:

During training, the model learns to generate an indication of whether retrieval is required to answer the\(q\) The search mark ([RT]/[NoRT]). In the inference process, a hybrid adaptive retrieval scheme is then used to synthesize the retrieval markers and model confidence to determine whether retrieval is required.
If retrieval is not required.\(\mathcal{M}_{G}\) using onlyLLMThe response is generated with knowledge of the parameters (i.e., the\(o\) act as\(y_{pred}\) (Return).
If retrieval is required, for knowledge from external sources\(D = \{d_i\}_{i=1}^{N_d}\) single- or multi-step, using a user-defined freezing retriever\(R\) to retrieve the former\(k\) documentation\(S = \{s_t\}_{t=1}^{k}\) Each of these\(s_t\) leave it (to sb)\(\{r_j\}_{j=1}^{N_H}\) Composition.\(r_j \in D\) ， \(N_H\) Indicates the number of steps.
- For each retrieved content\(s_t\) ， \(\mathcal{M}_{G}\) Generate a correlation marker, output response\(y_t\) , a base marker and a utility marker.
  - Correlation markers ([Relevant/Irrelevant]) directs that\(s_t\) whether it is related to\(q\) Related.
  - Basic Marker ([)Fully Supported/Partially Supported/No Support]) directs that\(y_t\) availability\(s_t\) of support.
  - Utility Marker ([U:])1]-[U:5]) Definitions\(y_t\) treat (sb a certain way)\(q\) The degree of usefulness of the
- Parallel processing of each\(s_t\) and by applying to them (i.e., all\(y_t\) ) are ranked to generate the final answer\(y_{pred}\) , the ranking is based on a weighted sum of the normalized confidence levels of the relevance, base and utility markers of the corresponding predictions.

Open-RAG

Data collection

in order to\({\tt Open-RAG}\) Capable of handling no-retrieval queries, as well as single- and multi-step queries that require retrieval, training data is constructed using various types of tasks and datasets. Given input-output data pairs from the original dataset (\(q\) , \(y\) ), by utilizing real label annotations orLLM \(C\) Generate markers to augment the data to create supervisory data.

in the event that\(C\) The corresponding search mark added is [RT], then the data is further enhanced to create three different new tags based on the following.

utilization\(R\) pre-search\(k\) documentation\(S\) . For each retrieved document\(s_t\) ， \(C\) valuation\(s_t\) whether it is relevant or not, and returns a relevance marker. In order to solve the problem of single- and multi-step queries, the data pipeline is equipped with a step unification heuristic: if at least one paragraph\(\{r_j\} \in s_t\) is relevant, then add the relevance marker as [Relevant], otherwise use [Irrelevant]。
When the prediction is [RelevantWhen ], in order to make the\(\mathcal{M}_{G}\) have the ability to\(s_t\) in a more fine-grained distinction between useful and interfering contexts, a data comparison heuristic was devised: (i) for a single-stepRAGdataset, directly using the\(C\) to mark the base tag; (ii) For multi-stepRAGdataset, if all paragraphs\(\{r_j\} \in s_t\) is individually predicted to be [RT], then set [Fully Supported] is added as a base tag; otherwise, use [ ].Partially Supported]。
Regardless of the prediction of the correlation marker, the correlation marker is used to predict the correlation of the correlation marker using the\(C\) because of\(y\) Provides a comparison between\(q\) of the utility score.

parameter-efficient`MoE`trimming

RAGTasks are inherently complex, consisting of a variety of components such as queries with single (single-step) or multiple (multi-step) passages. Selectively utilizing different parts of the model based on these complexities can facilitate more adaptive and nuanced reasoning capabilities for diverse input contexts.

Therefore, the paper uses sparse upgrading (sparse upcyclingwill\(\mathcal{M}_{G}\) convert toMoEarchitecture and dynamically learns to selectively activate the most appropriate expert for each query with diverse complexity (e.g., single/multi-step) as needed. This selective activation is learned (fine-tuned) by the preceding tailored training data, ensuring that the model is able to distinguish between useful and misleading information.

thinly spreadMoE \({\tt Open-RAG}\) The model is efficiently modeled by a parameterMoEConversion block enhances dense backboneLLM(used form a nominal expression)FFNlayer, the conversion block consists of a set of expert layers\(\mathbf{E} = \{\mathcal{E}_e\}_{e=1}^{N_E}\) and efficient routing mechanisms are composed.

Each expert layer contains a replica of the original shareFFNlayer weights by having the parameter\(\theta_e\) Adapter Modules\(\mathcal{A}_{e}\) Adaptations were performed. To ensure parameter efficiency, in each expert it is important to keep theFFNLayers are unchanged, only the adapter module is trained\(\mathcal{A}_{e}\) . In this way, it is only necessary to store aFFNreplica, keeping the model size unchanged, except for the increase of parameters in the adapter and routing modules. The remaining layers, such asNormcap (a poem)Attention, which is then copied from the dense model.

\[\begin{equation} \mathcal{A}_{e}(x) = \sigma(x W_{e}^{down}){W_{e}^{up}} + x. \end{equation} \]

For a given input\(x\) Routing Module\(\mathcal{R}\) Normalized output based on the attention layer\(x_{in}\) through (a gap)\(N_E\) Activation in experts\(\texttt{Top-}k\) An expert. Consider\(W_{|\cdot|}\) denotes the weight of the corresponding expert module, the routing module will be defined as follows:

\[\begin{equation} \mathcal{R}(x_{in}) = \text{Softmax}(\texttt{Top-}k(W_{\mathcal{R}} \cdot x_{in})) \end{equation} \]

\({\tt Open-RAG}\) The efficiency of the model stems from the following setup:\(|\theta_e| = |W_{e}^{down}| + |W_{e}^{up}| \ll |\phi_o|\) , which remains intensive during the fine-tuning processLLM(used form a nominal expression)\(\phi_o\) Unchanged.

Finally, the output of the parameter-efficient expert module will be\(y\) Expressed as:

\[\begin{equation} y = \sum_{e=1}^{N_E} \mathcal{R}(x)_e \mathcal{A}_e (\mathcal{E}_e(x)). \end{equation} \]

Hybrid Adaptive Search

due toLLMWith different parameter knowledge, the paper proposes a hybrid adaptive retrieval method that provides two choices of thresholds based on model confidence, retrieves on-demand and balances performance and speed.

During training.\(\mathcal{M}_{G}\) Learning to generate search reflection tags ([RT] and [NoRT]). Inference is done by setting [NoRT] is added to the input to measure the output sequence based on the forced non-retrieval setting\(o\) confidence, which leads to the conclusion that\(\hat{q} = q \oplus \texttt{[NoRT]}\) .. Two different confidence scores were devised\(f_{|\cdot|}\) ： (i) \(f_{minp}\) , i.e., the minimum of the individual marking probability, and (ii) \(f_{meanp}\) , i.e., the geometric mean of the probabilities of individual markers in the generated sequence.

\[\begin{align} \label{eq:meanpscore} f_{minp}(o | \hat{q}) &= \min_{i=1}^{m} p(o_i|\hat{q}, o_{<i}) \\ f_{meanp}(o| \hat{q}) &= \sqrt[m]{\prod_{i=1}^{m} p(o_i|\hat{q}, o_{<i})} \end{align} \]

By adjustable threshold\(\gamma\) Controls the frequency of retrieval when\(f_{|\cdot|}<\gamma\) Retrieve when.

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

Open-RAG: Integrating Open Source LLM Models into Efficient RAG Models | ENMLP'24

Data collection

parameter-efficientMoEtrimming

Hybrid Adaptive Search

parameter-efficient`MoE`trimming