DIKI: Tsinghua proposes a residual-based controlled continuous learning scheme for perfect retention of pre-trained knowledge

This study addresses the domain-category incremental learning problem, a realistic but challenging continuous learning scenario in which the domain distribution and target categories vary across tasks. To cope with these diverse tasks, a pre-trained visual-verbal model is introduced (VLMs), as they have a strong generalization ability. However, this also raises a new issue: when adapting to new tasks, pre-trainingVLMsknowledge encoded in them may be disturbed, thus compromising their inherent zero-sample capability. Existing methods have been developed by applying an additional dataset to theVLMsKnowledge distillation is performed to solve this problem, but this requires a large computational overhead. In order to solve this problem efficiently, the paper proposes distribution-aware interference-free knowledge integration (DIKI) framework, retained from the point of view of avoiding information interferenceVLMsof the pre-trained knowledge. Specifically, a fully residual mechanism is designed to inject newly learned knowledge into a frozen backbone network while minimally adversely affecting the pretrained knowledge. Furthermore, this residual property enables a distribution-aware integrated calibration scheme that explicitly controls the information implantation process for test data from unknown distributions. The experiments show that theDIKIexceeds current state-of-the-art methods using only the0.86%of the training parameters and the required training time is drastically reduced.

discuss a paper or thesis (old): Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models

Paper Address:/abs/2407.05342
Thesis Code:/lloongx/DIKI

Introduction

Supervised learning techniques train the network with full access to all data, which can lead to a lack of flexibility in extending the network to gain knowledge of new tasks. Continuous learning (CL) has emerged as a solution that allows models to be continuously trained on successively arriving data while retaining the information learned. The traditionalCLSettings generally consider only newly introduced changes in category or domain distributions, which are referred to as category incremental learning and domain incremental learning. However, existing work that considers only one type of increment limits their applicability in complex real-world scenarios.

Consider a more challenging area-category incremental learning (DCIL) setting, in which the domain data distribution and the categories to be categorized may keep changing across all tasks, as shown in Fig.1(a)Shown. In this case, techniques based on conventional image encoders cannot be realized due to their non-scalable classification head design. Recently, contrast-trained visual-verbal models (VLMse.g.CLIPThe emergence of a new generation has made it possible to solve this demanding but practical problem.VLMsis trained on large-scale image-text pairs with strong zero-sample generalization to recognize an almost infinite number of categories for such severe task-varying scenarios.

However, the use of visual-verbal models introduces a new challenge of incremental training. Traditional continuous learning schemes aim to prevent models from forgetting previously learned knowledge, which is referred to as backward forgetting (forgetting fine-tuned knowledge). Existing research has explored the potential of regularization mechanisms, review buffers, and architectural design in mitigating backward forgetting with encouraging results. However, when these approaches are applied to visual-verbal models, a different form of catastrophic forgetting emerges: the models tend to forget what they have learned in the pre-training phase, thus hampering their strong zero-sample generalization ability. This problem is known as forward forgetting (forgetting pre-training knowledge) because it occurs whenVLMsWhen predicting "forward" for unknown distribution data. Fig.1(a)demonstrates both types of forgetting.

Recent workZSCLAttempts to resolveCLIPon the forward forgetting problem, a large-scale reference dataset is introduced for knowledge distillation, combined with a weight integration scheme. However, this approach requires a large amount of computation and external data, which may not be feasible in practical scenarios. At the same time, the existing methods based onVLMof parameter-efficient continuous learning methods primarily utilize a cue tuning mechanism that fails to retain pre-training knowledge and leads to a decrease in zero-sample capability, as shown in Fig.1shown in (b). The paper attributes this problem to information interference: newly introduced task-specific parameters may interfere with pre-training knowledge. A schematic of these methods is shown in Fig.1(c)Shown.

In order to mitigate in a computationally and parametrically efficient mannerVLMsof the forward forgetting problem, the paper introduces a distribution-aware interference-free knowledge fusion (DIKI) framework. Specifically, task-specific information is injected into the frozenVLMin order to efficiently store learned knowledge for each task.

The contribution of the paper is summarized in three points:

The introduction of a parameter-efficientDIKIin order to be in theDCILReservations under SettingsVLMin the pre-trained knowledge. It solves the problem of information interference and reduces the need for large amounts of computation and external data.
To ease forward oblivion.DIKINew knowledge is implanted in a fully residualized manner, keeping the pre-trained knowledge undisturbed. With this residual property, distribution-aware fusion calibration is further integrated to improve performance on unseen tasks.
Comprehensive experiments have shown that, compared to previous methods, theDIKIby means of0.86%The state-of-the-art performance is achieved with significantly fewer training parameters and significantly less training time.

Preliminaries

Continual learning protocol

Continuous learning aims to learn different tasks in a sequential manner without forgetting what has been learned previously. Consider\(N\) sequential mission\(\left[ \mathcal{T}^1, \mathcal{T}^2, \cdots, \mathcal{T}^N \right]\) Each task\(\mathcal{T}^i\) Contains a dataset\(D^i=\{x^i_j, y^i_j\}_{j=1}^{N^i}\) which\(x^i_j\) is an image that\(y^i_j\) is the corresponding unique hot tag in the current dataset.\(N^i\) is the number of image samples. In addition, a collection of class names\(C^i=\{c^i_j\}_{j=1}^{N_{c}^i}\) , connects the label index to theVLMsThe name of the category used.

In contrast to previous incremental learning settings for categories and domains, this study emphasizes a more practical setting for continuous learning: domain-category incremental learning (DCIL). In this setup, the domain distribution and the categories to be recognized change continuously between tasks, i.e., the\(C^i \neq C^j\) cap (a poem)\(\mathbb{P}(D^i) \neq \mathbb{P}(D^j)\) For\(i \neq j\) which\(\mathbb{P}\) Represents the data distribution of the task dataset.

Vision-language models

Incremental learning in challenging areas-categories (DCIL) setup to train a model based on a common image encoder such asResNetscap (a poem)ViTsthat are not practical for incremental learning of strongly changing domains and categories. Therefore, pre-trained visual-verbal models are introduced because of their strong zero-sample migration capabilities.CLIPContains an image encoder\(f\) and a text encoder\(g\) , they are trained to generate tightly aligned features for paired image-text samples. At the time of inference, the\(f\) First the input image will be\(x\) Encoded as an eigenvector\(f(x)\) . At the same time, potential class names are embedded in a template, e.g., "A {\(c\) } photo" and then by the\(g\) Coding to form text embedding\(\{t_j\}_{j=1}^{N_c}\) . The predictions of the model are determined by the maximum similarity score between the image embedding and all text embeddings\(s_j = \Braket{f(x), t_j}\) which\(\Braket{\cdot, \cdot}\) denotes the cosine similarity.

Task-specific prompt learning

A series of studies have begun to explore the potential for efficient fine-tuning of parameters in continuous learning. A common approach is to learn and store a set of lightweight cues for each task, forming a "pool of cues" during the continuous learning phase, denoted as:

\[\begin{equation} \mathbf{P}=\{P_1, P_2, \cdots, P_N\},\ \ \text{where}\ P_i\in \mathbb{R}^{l\times d}, \end{equation} \]

included among these\(N\) is the task number.\(l\) cap (a poem)\(d\) are the length of the cue and the dimension of the feature embedding, respectively.

When reasoning, well-trained cues are selected and will be attached to a pre-trained frozen model to recover the learned knowledge. Assumptions\(\mathbf{x_e}\in \mathbb{R}^{L\times d}\) beTransformerfloor (of a building)\(h\) feature embedding, then the hint can be added to the\(\mathbf{x_e}\) front to generate prompted input:

\[\begin{equation} \mathbf{x_p} = \left[P_s^1; P_s^2; \cdots; P_s^l; \mathbf{x_e}\right] \in \mathbb{R}^{(l+L)\times d}, \end{equation} \]

included among these\(\{P_s^i\in \mathbb{R}^{d}\}_{i=1}^l\) is the selected tip\(P_s\) The embedding vector of\(;\) representtokenlength dimension of the linking operation. With this implanted knowledge, better image and text feature embeddings are generated and the final classification accuracy is improved.

The cue selection process mentioned above is realized by query-key matching. In the continuous training phase, the average feature representation of each task is learned by maximizing the cosine similarity or applying a clustering algorithm\(\mathbf{I}=\{I^i\}_{i=1}^N\) . When the test sample\(\mathbf{x}\) When it arrives, perform a key lookup operation:

\[\begin{equation} \label{eq_matching} I_s = {\arg \max}_{I^i\sim \mathbf{I}}\Braket{f(\mathbf{x}), I^i}. \end{equation} \]

By the most relevant key\(I_s\) Select the appropriate prompts\(P_s\) and attach it to the freezing model to perform the inference process.

Methodology

Interference-free Knowledge Integration

Is prepending the best choice?

Although the prompts are pre-added to the inputtokensmethods are widely used because of their simplicity of implementation, but the paper finds that they face two problems.

Combine the prompt with the inputtokensMaking connections causes them to interact with each other during attention, thus affecting the extraction of pre-trained knowledge. When the test samples come from the distribution of the model when it learns the cue, the adapted model can maintain relatively satisfactory results. However, once samples with changed distributions are encountered, this interference may lead to a decrease in the model's performance and the loss of its important zero-sample generalization ability, causing the forward forgetting problem.
Simply adding prompts in advance inevitably adds all theTransformercubictokenlength, which in many cases hastokenlength-constrained scenarios is not ideal. In addition, it has limited scalability: longer cue contexts may cause the text encoder to ignore important category names, resulting in a poor representation of the text embedding.

The existence of the above problems suggests that cue-based tuning methods do not satisfy the "residual property": it is expected that the learned parameters should be residual paths parallel to the frozen backbone, complementing new knowledge without affecting the critical pre-training knowledge. Therefore, the paper proposes an interference-free knowledge integration (Interference-free Knowledge Integration，IKI) scheme to minimize noise by injecting newly learned knowledge into the pre-trainedVLMCenter.

IKI mechanism

Instead of training a series of pre-added cue vectors for each task, the paper focuses on a modification of the self-attention mechanism, which follows the parameter-efficient fine-tuning approach widely used in the field of natural language processing. Recall that inTransformerfloor (of a building)\(h\) in the inputtokens \(\mathbf{x_e}\in \mathbb{R}^{L\times d}\) The multi-head self-attention mechanism carried out. For simplicity, the multi-head design is omitted and only the single-head case is considered, which can be naturally extended to multi-head scenarios. Input.tokensThe query is first converted by linear projection to\(Q\) key\(K\) and value\(V\) Matrix:

\[\begin{equation} Q_e = \mathbf{x_e}W^Q + b^Q; K_e = \mathbf{x_e}W^K + b^K; V_e = \mathbf{x_e}W^V + b^V, \end{equation} \]

included among these\(W\in \mathbb{R}^{d\times d}\) cap (a poem)\(b\in \mathbb{R}^{d}\) are the pre-training parameters. Then, the self-attention computation is executed to generate the output matrix by

\[\begin{equation} O_L = \text{Attn}(Q_e, K_e)V_e = \text{softmax}(\frac{Q_eK_e^T}{\sqrt{d}})V_e\ \ \in \mathbb{R}^{L\times d}, \end{equation} \]

included among these\(\text{softmax}(\mathbf{z})_i = \frac{\exp{(\mathbf{z_i})}}{\sum_j\exp{(\mathbf{z_j})}}\) Elements in the attention result can be constrained\(\text{Attn}(Q_e, K_e)\in \mathbb{R}^{L\times L}\) The sum of one.

The normal cue tuning approach adds trainable cues to the inputtokensin which the\(\mathbf{x_e}\in \mathbb{R}^{L\times d}\) Expand to\(\mathbf{x_p}\in \mathbb{R}^{(l+L)\times d}\) . Then, the computation of\(Q_{p}K_{p}^T\in \mathbb{R}^{(l+L)\times (l+L)}\) and pass it on to thesoftmaxfunction. In thesoftmaxInside the calculation, the inputtokensand cued attention scores interact and influence each other, leading to an unavoidable loss of pretrained knowledge, as shown in Fig.2(a)Shown.

To solve this problem, the paper separately calculates the inputtokensSelf-attention and cueing with input withintokensCross-attention between, as shown in2(b)Shown. In other words, only one residual attention branch is trained, keeping the existing attention scores unchanged. By means of the newly introduced key\(K_r\) sum\(V_r\) , the output of the residual attention branch can be expressed as:

\[\begin{equation} \label{eq:res_attn} O_r = \text{softmax}(\frac{Q_eK_r^T}{\sqrt{d}})V_r, \text{where}\ K_r,V_r\in \mathbb{R}^{l\times d}. \end{equation} \]

Here, the residual output\(O_r\in \mathbb{R}^{L\times d}\) By comparing with the original output\(O_L\) derived from the orthogonal paths, which has no effect on the original attention process. Finally, by addition of the paths stored in the\(O_r\) in the learned knowledge implanted in the output. During the continuous training phase, updating the learnable key\(K_r\) sum\(V_r\) Instead of the usual tips\(P\) . Note that no query parameters were introduced to keep the sequence length constant.

Ideally, an ideal residual block should not affect the original branch until it has been trained on a downstream dataset, such as at initialization. The widely used way to initialize the cue with a uniform or normal distribution, which would provide no learning to the pre-trainedVLMsin which random noise is injected. Specifically, the random noise is injected into the randomized noise by adding the parameter\(V_r\) Initialization to zero forces the residual attention addition to be a constant function:

\[\begin{equation} O = O_L+O_r^{\text{init}} = O_L+\text{softmax}(\frac{Q_eK_r^T}{\sqrt{d}})\mathbf{[0]}^{l\times d} = O_L. \end{equation} \]

Note that the paper only starts with the value\(V_r^{\text{init}}\) Limit to zero while keeping the\(K_r\) Random initialization. This is due to the fact that the initialization of the\(K_r\) cap (a poem)\(V_r\) are initialized to zero matrices will prevent\(K_r\) Updated by gradients so that the\(V_r\) into vectors with the same value.

Since zero-initialization is more of an option than a technique, several studies have employed it in a variety of tasks. However, these works utilize zero-initialization to ensure a stable and progressive training regime, whereas in theDCILThis concern does not exist in the scenario. The paper argues that zero initialization is crucial for residual attention design to inject new knowledge with minimal noise into the pre-trainedVLMsCenter.

Distribution-aware Integration Calibration

Observations

When reasoning, the formula is executed3The query-key matching mechanism described in to retrieve learning cues appropriate for the current test sample. This approach is designed for a traditional continuous learning setup that only considers backward forgetting. However, when confronted with data from unseen domains, this simple matching design is enforced so as to assign a relatively similar task to the test samples, despite the significant distributional gap between them.

thanks toIKIof residual design can now introduce less noise in such mismatched scenarios compared to previous methods. However, when the difference between the training and testing distributions increases, some degree of performance degradation of the model is inevitable, which can harm theVLMsThe zero-sample ability learned during the pre-training phase.

ZSCLThis was solved by distillation. They constructed a system that contains the data from theImageNet(used form a nominal expression)100,000The reference dataset of images to be used in each training step will be the originalCLIPof pre-training knowledge distilled into the current model and explicitly reviewed to avoid forgetting. This approach may be effective, but it relies on large-scale storage and high computational resources, thus making it impractical in real-world environments.

An intuitive solution is to control the extent to which knowledge is implanted in the model. However, previous techniques of pre-position-based cue tuning had only two choices: either append the learned cues, or leave the originalCLIPmodel to make any modifications. Thanks to theIKIof the elegant residual characterization, the ability to control this parallel branching is now possible.

DIKI: calibrate the integration with distribution

To determine the likelihood that a test sample belongs to a learned task, a feature distribution is maintained for each task, rather than a single key vector. Here, the paper simply applies a multivariate Gaussian distribution and finds that it works well. Formally, in the training phase for the task\(i\) Build a\(\mathcal{N}^i(\mathbf{\mu}^i, \mathbf{\Sigma}^i)\) ：

\[\begin{equation} \begin{gathered} \mathbf{\mu}^i = \mathbb{E}_{\mathbf{x}^i_j \sim D^i}[f(\mathbf{x}^i_j)], \ \ \ \mathbf{\Sigma}^i = \mathbb{E}_{\mathbf{x}^i_j \sim D^i}[(f(\mathbf{x}^i_j)-\mathbf{\mu}^i)^T(f(\mathbf{x}^i_j)-\mathbf{\mu}^i)], \end{gathered} \end{equation} \]

included among these\(f(\mathbf{x}^i_j)\) are the image features extracted by the freeze encoder. With these estimated distributions, it is possible to compute each\(\mathcal{N}^i\) The likelihood of the test sample being drawn in the Here, the logarithm of the computed probability density is used as an input\(\mathbf{x}\) scoring function on each learning task:

\[\begin{equation} \begin{split} S^i &= \log \varphi(f(\mathbf{x}); \mathbf{\mu}^i, \mathbf{\Sigma}^i) \\ &= - \frac{1}{2}[ (f(\mathbf{x})-\mathbf{\mu}^i)^T(\mathbf{\Sigma}^i)^{-1}(f(\mathbf{x})-\mathbf{\mu}^i) + d\log 2\pi + \log |\mathbf{\Sigma}^i|) ], \end{split} \end{equation} \]

included among these\(\varphi\) is the probability density function.

Intuitively, the higher scoring samples\(S^i\) It is more likely to be from the mission\(i\) extracted from the program, and should introduce the parameter\(K_r^i, V_r^i\) to make model predictions. In addition, the input samples should be considered\(\mathbf{x}\) may come from some new distribution if all\(S^i\) are low, and this is implied. Therefore, utilizing the maximum score\(\hat{S}=\max_{i\in [1,N]}S^{i}\) to weight the residual attention output:

\[\begin{equation} \label{eq:final_output} O = O_L+\mathcal{M}(\hat{S})O_r, \end{equation} \]

included among these\(\mathcal{M}\) is a mapping function that will score\(\hat{S}\) Zoom to Scope\([0,1]\) .. Here, the paper finds simpleSigmoidfunction (math.)\(\sigma(x)=\frac{1}{1+e^{-x}}\) It works well here. Thanks to this integrated calibration mechanism based on distributed sensing, theVLMsof pre-trained zero-sample capability can be better preserved, and the problem of forward forgetting is further addressed by assigning lower weights to unfamiliar images.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

DIKI: Tsinghua proposes a residual-based controlled continuous learning scheme for perfect retention of pre-trained knowledge | ECCV'24

Continual learning protocol

Vision-language models

Task-specific prompt learning

Interference-free Knowledge Integration

Is prepending the best choice?

IKI mechanism

Distribution-aware Integration Calibration

Observations

DIKI: calibrate the integration with distribution