This study addresses the domain-category incremental learning problem, a realistic but challenging continuous learning scenario in which the domain distribution and target categories vary across tasks. To cope with these diverse tasks, a pre-trained visual-verbal model is introduced (
VLMs
), as they have a strong generalization ability. However, this also raises a new issue: when adapting to new tasks, pre-trainingVLMs
knowledge encoded in them may be disturbed, thus compromising their inherent zero-sample capability. Existing methods have been developed by applying an additional dataset to theVLMs
Knowledge distillation is performed to solve this problem, but this requires a large computational overhead. In order to solve this problem efficiently, the paper proposes distribution-aware interference-free knowledge integration (DIKI
) framework, retained from the point of view of avoiding information interferenceVLMs
of the pre-trained knowledge. Specifically, a fully residual mechanism is designed to inject newly learned knowledge into a frozen backbone network while minimally adversely affecting the pretrained knowledge. Furthermore, this residual property enables a distribution-aware integrated calibration scheme that explicitly controls the information implantation process for test data from unknown distributions. The experiments show that theDIKI
exceeds current state-of-the-art methods using only the0.86%
of the training parameters and the required training time is drastically reduced.
discuss a paper or thesis (old): Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models
- Paper Address:/abs/2407.05342
- Thesis Code:/lloongx/DIKI
Introduction
Supervised learning techniques train the network with full access to all data, which can lead to a lack of flexibility in extending the network to gain knowledge of new tasks. Continuous learning (CL
) has emerged as a solution that allows models to be continuously trained on successively arriving data while retaining the information learned. The traditionalCL
Settings generally consider only newly introduced changes in category or domain distributions, which are referred to as category incremental learning and domain incremental learning. However, existing work that considers only one type of increment limits their applicability in complex real-world scenarios.
Consider a more challenging area-category incremental learning (DCIL
) setting, in which the domain data distribution and the categories to be categorized may keep changing across all tasks, as shown in Fig.1(a)
Shown. In this case, techniques based on conventional image encoders cannot be realized due to their non-scalable classification head design. Recently, contrast-trained visual-verbal models (VLMs
e.g.CLIP
The emergence of a new generation has made it possible to solve this demanding but practical problem.VLMs
is trained on large-scale image-text pairs with strong zero-sample generalization to recognize an almost infinite number of categories for such severe task-varying scenarios.
However, the use of visual-verbal models introduces a new challenge of incremental training. Traditional continuous learning schemes aim to prevent models from forgetting previously learned knowledge, which is referred to as backward forgetting (forgetting fine-tuned knowledge). Existing research has explored the potential of regularization mechanisms, review buffers, and architectural design in mitigating backward forgetting with encouraging results. However, when these approaches are applied to visual-verbal models, a different form of catastrophic forgetting emerges: the models tend to forget what they have learned in the pre-training phase, thus hampering their strong zero-sample generalization ability. This problem is known as forward forgetting (forgetting pre-training knowledge) because it occurs whenVLMs
When predicting "forward" for unknown distribution data. Fig.1(a)
demonstrates both types of forgetting.
Recent workZSCL
Attempts to resolveCLIP
on the forward forgetting problem, a large-scale reference dataset is introduced for knowledge distillation, combined with a weight integration scheme. However, this approach requires a large amount of computation and external data, which may not be feasible in practical scenarios. At the same time, the existing methods based onVLM
of parameter-efficient continuous learning methods primarily utilize a cue tuning mechanism that fails to retain pre-training knowledge and leads to a decrease in zero-sample capability, as shown in Fig.1
shown in (b). The paper attributes this problem to information interference: newly introduced task-specific parameters may interfere with pre-training knowledge. A schematic of these methods is shown in Fig.1(c)
Shown.
In order to mitigate in a computationally and parametrically efficient mannerVLMs
of the forward forgetting problem, the paper introduces a distribution-aware interference-free knowledge fusion (DIKI
) framework. Specifically, task-specific information is injected into the frozenVLM
in order to efficiently store learned knowledge for each task.
The contribution of the paper is summarized in three points:
- The introduction of a parameter-efficient
DIKI
in order to be in theDCIL
Reservations under SettingsVLM
in the pre-trained knowledge. It solves the problem of information interference and reduces the need for large amounts of computation and external data. - To ease forward oblivion.
DIKI
New knowledge is implanted in a fully residualized manner, keeping the pre-trained knowledge undisturbed. With this residual property, distribution-aware fusion calibration is further integrated to improve performance on unseen tasks. - Comprehensive experiments have shown that, compared to previous methods, the
DIKI
by means of0.86%
The state-of-the-art performance is achieved with significantly fewer training parameters and significantly less training time.
Preliminaries
-
Continual learning protocol
Continuous learning aims to learn different tasks in a sequential manner without forgetting what has been learned previously. Consider\(N\) sequential mission\(\left[ \mathcal{T}^1, \mathcal{T}^2, \cdots, \mathcal{T}^N \right]\) Each task\(\mathcal{T}^i\) Contains a dataset\(D^i=\{x^i_j, y^i_j\}_{j=1}^{N^i}\) which\(x^i_j\) is an image that\(y^i_j\) is the corresponding unique hot tag in the current dataset.\(N^i\) is the number of image samples. In addition, a collection of class names\(C^i=\{c^i_j\}_{j=1}^{N_{c}^i}\) , connects the label index to theVLMs
The name of the category used.
In contrast to previous incremental learning settings for categories and domains, this study emphasizes a more practical setting for continuous learning: domain-category incremental learning (DCIL
). In this setup, the domain distribution and the categories to be recognized change continuously between tasks, i.e., the\(C^i \neq C^j\) cap (a poem)\(\mathbb{P}(D^i) \neq \mathbb{P}(D^j)\) For\(i \neq j\) which\(\mathbb{P}\) Represents the data distribution of the task dataset.
-
Vision-language models
Incremental learning in challenging areas-categories (DCIL
) setup to train a model based on a common image encoder such asResNets
cap (a poem)ViTs
that are not practical for incremental learning of strongly changing domains and categories. Therefore, pre-trained visual-verbal models are introduced because of their strong zero-sample migration capabilities.CLIP
Contains an image encoder\(f\) and a text encoder\(g\) , they are trained to generate tightly aligned features for paired image-text samples. At the time of inference, the\(f\) First the input image will be\(x\) Encoded as an eigenvector\(f(x)\) . At the same time, potential class names are embedded in a template, e.g., "A {\(c\) } photo" and then by the\(g\) Coding to form text embedding\(\{t_j\}_{j=1}^{N_c}\) . The predictions of the model are determined by the maximum similarity score between the image embedding and all text embeddings\(s_j = \Braket{f(x), t_j}\) which\(\Braket{\cdot, \cdot}\) denotes the cosine similarity.
-
Task-specific prompt learning
A series of studies have begun to explore the potential for efficient fine-tuning of parameters in continuous learning. A common approach is to learn and store a set of lightweight cues for each task, forming a "pool of cues" during the continuous learning phase, denoted as:
included among these\(N\) is the task number.\(l\) cap (a poem)\(d\) are the length of the cue and the dimension of the feature embedding, respectively.
When reasoning, well-trained cues are selected and will be attached to a pre-trained frozen model to recover the learned knowledge. Assumptions\(\mathbf{x_e}\in \mathbb{R}^{L\times d}\) beTransformer
floor (of a building)\(h\) feature embedding, then the hint can be added to the\(\mathbf{x_e}\) front to generate prompted input:
included among these\(\{P_s^i\in \mathbb{R}^{d}\}_{i=1}^l\) is the selected tip\(P_s\) The embedding vector of\(;\) representtoken
length dimension of the linking operation. With this implanted knowledge, better image and text feature embeddings are generated and the final classification accuracy is improved.
The cue selection process mentioned above is realized by query-key matching. In the continuous training phase, the average feature representation of each task is learned by maximizing the cosine similarity or applying a clustering algorithm\(\mathbf{I}=\{I^i\}_{i=1}^N\) . When the test sample\(\mathbf{x}\) When it arrives, perform a key lookup operation:
By the most relevant key\(I_s\) Select the appropriate prompts\(P_s\) and attach it to the freezing model to perform the inference process.
Methodology
Interference-free Knowledge Integration
-
Is prepending the best choice?
Although the prompts are pre-added to the inputtokens
methods are widely used because of their simplicity of implementation, but the paper finds that they face two problems.
- Combine the prompt with the input
tokens
Making connections causes them to interact with each other during attention, thus affecting the extraction of pre-trained knowledge. When the test samples come from the distribution of the model when it learns the cue, the adapted model can maintain relatively satisfactory results. However, once samples with changed distributions are encountered, this interference may lead to a decrease in the model's performance and the loss of its important zero-sample generalization ability, causing the forward forgetting problem. - Simply adding prompts in advance inevitably adds all the
Transformer
cubictoken
length, which in many cases hastoken
length-constrained scenarios is not ideal. In addition, it has limited scalability: longer cue contexts may cause the text encoder to ignore important category names, resulting in a poor representation of the text embedding.
The existence of the above problems suggests that cue-based tuning methods do not satisfy the "residual property": it is expected that the learned parameters should be residual paths parallel to the frozen backbone, complementing new knowledge without affecting the critical pre-training knowledge. Therefore, the paper proposes an interference-free knowledge integration (Interference-free Knowledge Integration
,IKI
) scheme to minimize noise by injecting newly learned knowledge into the pre-trainedVLM
Center.
-
IKI mechanism
Instead of training a series of pre-added cue vectors for each task, the paper focuses on a modification of the self-attention mechanism, which follows the parameter-efficient fine-tuning approach widely used in the field of natural language processing. Recall that inTransformer
floor (of a building)\(h\) in the inputtokens
\(\mathbf{x_e}\in \mathbb{R}^{L\times d}\) The multi-head self-attention mechanism carried out. For simplicity, the multi-head design is omitted and only the single-head case is considered, which can be naturally extended to multi-head scenarios. Input.tokens
The query is first converted by linear projection to\(Q\) key\(K\) and value\(V\) Matrix:
included among these\(W\in \mathbb{R}^{d\times d}\) cap (a poem)\(b\in \mathbb{R}^{d}\) are the pre-training parameters. Then, the self-attention computation is executed to generate the output matrix by
included among these\(\text{softmax}(\mathbf{z})_i = \frac{\exp{(\mathbf{z_i})}}{\sum_j\exp{(\mathbf{z_j})}}\) Elements in the attention result can be constrained\(\text{Attn}(Q_e, K_e)\in \mathbb{R}^{L\times L}\) The sum of one.
The normal cue tuning approach adds trainable cues to the inputtokens
in which the\(\mathbf{x_e}\in \mathbb{R}^{L\times d}\) Expand to\(\mathbf{x_p}\in \mathbb{R}^{(l+L)\times d}\) . Then, the computation of\(Q_{p}K_{p}^T\in \mathbb{R}^{(l+L)\times (l+L)}\) and pass it on to thesoftmax
function. In thesoftmax
Inside the calculation, the inputtokens
and cued attention scores interact and influence each other, leading to an unavoidable loss of pretrained knowledge, as shown in Fig.2(a)
Shown.
To solve this problem, the paper separately calculates the inputtokens
Self-attention and cueing with input withintokens
Cross-attention between, as shown in2(b)
Shown. In other words, only one residual attention branch is trained, keeping the existing attention scores unchanged. By means of the newly introduced key\(K_r\) sum\(V_r\) , the output of the residual attention branch can be expressed as:
Here, the residual output\(O_r\in \mathbb{R}^{L\times d}\) By comparing with the original output\(O_L\) derived from the orthogonal paths, which has no effect on the original attention process. Finally, by addition of the paths stored in the\(O_r\) in the learned knowledge implanted in the output. During the continuous training phase, updating the learnable key\(K_r\) sum\(V_r\) Instead of the usual tips\(P\) . Note that no query parameters were introduced to keep the sequence length constant.
Ideally, an ideal residual block should not affect the original branch until it has been trained on a downstream dataset, such as at initialization. The widely used way to initialize the cue with a uniform or normal distribution, which would provide no learning to the pre-trainedVLMs
in which random noise is injected. Specifically, the random noise is injected into the randomized noise by adding the parameter\(V_r\) Initialization to zero forces the residual attention addition to be a constant function:
Note that the paper only starts with the value\(V_r^{\text{init}}\) Limit to zero while keeping the\(K_r\) Random initialization. This is due to the fact that the initialization of the\(K_r\) cap (a poem)\(V_r\) are initialized to zero matrices will prevent\(K_r\) Updated by gradients so that the\(V_r\) into vectors with the same value.
Since zero-initialization is more of an option than a technique, several studies have employed it in a variety of tasks. However, these works utilize zero-initialization to ensure a stable and progressive training regime, whereas in theDCIL
This concern does not exist in the scenario. The paper argues that zero initialization is crucial for residual attention design to inject new knowledge with minimal noise into the pre-trainedVLMs
Center.
Distribution-aware Integration Calibration
-
Observations
When reasoning, the formula is executed3
The query-key matching mechanism described in to retrieve learning cues appropriate for the current test sample. This approach is designed for a traditional continuous learning setup that only considers backward forgetting. However, when confronted with data from unseen domains, this simple matching design is enforced so as to assign a relatively similar task to the test samples, despite the significant distributional gap between them.
thanks toIKI
of residual design can now introduce less noise in such mismatched scenarios compared to previous methods. However, when the difference between the training and testing distributions increases, some degree of performance degradation of the model is inevitable, which can harm theVLMs
The zero-sample ability learned during the pre-training phase.
ZSCL
This was solved by distillation. They constructed a system that contains the data from theImageNet
(used form a nominal expression)100,000
The reference dataset of images to be used in each training step will be the originalCLIP
of pre-training knowledge distilled into the current model and explicitly reviewed to avoid forgetting. This approach may be effective, but it relies on large-scale storage and high computational resources, thus making it impractical in real-world environments.
An intuitive solution is to control the extent to which knowledge is implanted in the model. However, previous techniques of pre-position-based cue tuning had only two choices: either append the learned cues, or leave the originalCLIP
model to make any modifications. Thanks to theIKI
of the elegant residual characterization, the ability to control this parallel branching is now possible.
-
DIKI: calibrate the integration with distribution
To determine the likelihood that a test sample belongs to a learned task, a feature distribution is maintained for each task, rather than a single key vector. Here, the paper simply applies a multivariate Gaussian distribution and finds that it works well. Formally, in the training phase for the task\(i\) Build a\(\mathcal{N}^i(\mathbf{\mu}^i, \mathbf{\Sigma}^i)\) :
included among these\(f(\mathbf{x}^i_j)\) are the image features extracted by the freeze encoder. With these estimated distributions, it is possible to compute each\(\mathcal{N}^i\) The likelihood of the test sample being drawn in the Here, the logarithm of the computed probability density is used as an input\(\mathbf{x}\) scoring function on each learning task:
included among these\(\varphi\) is the probability density function.
Intuitively, the higher scoring samples\(S^i\) It is more likely to be from the mission\(i\) extracted from the program, and should introduce the parameter\(K_r^i, V_r^i\) to make model predictions. In addition, the input samples should be considered\(\mathbf{x}\) may come from some new distribution if all\(S^i\) are low, and this is implied. Therefore, utilizing the maximum score\(\hat{S}=\max_{i\in [1,N]}S^{i}\) to weight the residual attention output:
included among these\(\mathcal{M}\) is a mapping function that will score\(\hat{S}\) Zoom to Scope\([0,1]\) .. Here, the paper finds simpleSigmoid
function (math.)\(\sigma(x)=\frac{1}{1+e^{-x}}\) It works well here. Thanks to this integrated calibration mechanism based on distributed sensing, theVLMs
of pre-trained zero-sample capability can be better preserved, and the problem of forward forgetting is further addressed by assigning lower weights to unfamiliar images.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].