AnytimeCL: A New Scheme for Increasing Difficulty and Supporting Arbitrary Continuous Learning Scenarios

discuss a paper or thesis (old): Anytime Continual Learning for Open Vocabulary Classification

Paper Address:/abs/2409.08518
Thesis Code:/jessemelpolio/AnytimeCL

innovation point

When training online, each batch consists of new training samples and category-balanced stored samples.
The accuracy of each label is learned online to effectively weight the predictions of the original and adjusted models.
Losses are modified to support "none of the above" (not in the predefined labels) predictions, which also makes open vocabulary training more stable.
Intermediate layer feature compression, which reduces the storage of training samples and improves speed while having little impact on accuracy.

Content overview

The paper proposes arbitrary continuous learning for open vocabulary image categorization (AnytimeCL) approach, which aims to break through the limitations of batch training and rigorous modeling, requires the system to be able to predict any set of labels at any time, and to efficiently update and improve upon them whenever one or more training samples are received.

AnytimeCLBased on a dynamic weighting mechanism that combines the predictions of the partially fine-tuned model with the predictions of the original model. When new training samples are available, the stored samples are used to populate a class-balanced batch update of the fine-tuned model's finalTransformerblock, then update the estimates of the accuracy of the tuned and original models for a given label, and finally weight the predictions of the tuned and original models according to their expected accuracy for each label.

In addition, the paper proposes a principal component analysis based on attention weighting (PCA) training feature compression method, which reduces storage and computation requirements and has little impact on model accuracy.

AnytimeCL

The aim of the paper is to enhance an open-vocabulary image classifier to learn the target task by combining the tuned model with the original model. The tuned model uses the same encoder as the original model but contains a trainable decoder.

For an image\(x\) , the probability that both the tuned model and the original model generate all candidate labels, respectively, is denoted as\(P_t(y|x)\) cap (a poem)\(P_o(y|x)\) , the final probability is weighted by the online category (OCW) for weighting:

\[\begin{equation} \label{eq:our_weighting} P(y|x) = \alpha_o(y) P_t(y|x) + \alpha_t(y) P_o(y|x), \end{equation} \]

During training, new samples are encoded as intermediate features (feature vectors of an image block plus aCLSmarkers), which can optionally be compressed and stored for reuse in the future.

mould

original model

The original model is publicly availableCLIP ViTmodel, which is based on image embedding\(e_{x}\) （CLS(markup) with text embedding\(e_{y}\) The dot product of the image\(x\) Generate a given set of candidate text labels\(\mathcal{Y}\) labels\(y\) The probability of the

\[\begin{equation} \label{eq:class_wise_probability} P_o(y|x) = \frac{\exp(100 \cdot \cos(e_{x}, e_{y}))}{\sum_{y_k\in\mathcal{Y}} \exp(100 \cdot \cos(e_{x}, e_{y_k}))}. \end{equation} \]

tuning model

The tuning model only tunes the final imageTransformerblocks while keeping the label embedding fixed. This helps the features stay relevant to the text modal and reduces overfitting to the received labels.

Given a new sample, a batch containing that sample is constructed as well as a stored training sample that has been sampled with class balancing. In addition, a regularization loss is used to help improve performance. If the true label is not among the candidates, then each candidate label should predict a lower score. This is achieved by adding an "other" option to the candidate set, but since "other" has no specific performance, it is only modeled with a learnable bias term. Thus, the combined loss of training the tuning model is:

\[\begin{equation} \label{eq:final_loss} \mathcal{L}(x, y, \mathcal{Y}) =\mathcal{L}_{\text{ce}}(x,y,\mathcal{Y} \cup \text{other}) + \beta \mathcal{L}_{\text{ce}}(x,\text{other},(\mathcal{Y} \cup \text{other}) \setminus y), \end{equation} \]

Online category weighting (`OCW`）

Each training sample is used prior to updating to update the likelihood estimate of its labeling correctness based on the tuning and original prediction, thus assigning higher weights to models given correct labeling. Applying an exponential sliding average (EMA) updating methods to estimate them online is consistent with the goal of continuous learning at any time. AssumptionsEMAThe attenuation is set to\(\eta\) (defaults to\(0.99\) ), the estimation accuracy of the current step tuning model is:

\[\begin{equation} c_t(y) = \eta \hat{c}_t(y) + (1 - \eta) \mathbb{1}[y_t(x)=y]. \end{equation} \]

Here.\(\hat{c}_t(y)\) is the label from the previous step\(y\) The accuracy of the estimates of the\(y_t(x)\) denotes the effect of the tuning model on the\(x\) of the predicted labels. Since the exponential sliding average relies on past values, it will be\(c_t(y)\) Calculated as pre\(\lfloor \frac{1}{1-\eta} \rfloor\) The average accuracy of the samples.\(c_o(y)\) It is also updated in the same way.

Attainment of\(c_t(y)\) cap (a poem)\(c_o(y)\) After that, the weights of the two models are:

\[\begin{equation} \label{eq:final_alpha} \alpha_t(y)= \frac{c_t(y)}{c_t(y) + c_o(y) + \epsilon}, \qquad \alpha_o(y)= 1 - \alpha_t(y). \end{equation} \]

Here.\(\epsilon\) is a very small number (1e-8), which is used to prevent division by zero. For labels not seen by the tuning model, setting the\(\alpha_t(y)=0\) Therefore\(\alpha_o(y)=1\) 。

Storage Efficiency and Privacy

Tuning of the model requires either storing each image or storing the features (or markers) that are input to the tuning section. Storing images suffers from a lack of privacy and is spatially and computationally inefficient because of the need to recode them during training. Storing features mitigates some of these problems, but still uses a lot of memory or storage space.

Efficient representations of data learned by well-trained networks are often difficult to compress, and if one tries to use a dataset trained on anVQ-VAEmaybePCA(Principal Component Analysis) to compress the feature vectors will not be able to achieve any meaningful compression without a significant loss in training performance. However, the features in each image contain many redundancies. Therefore, calculating the number of features in each image for thePCAvectors and store these vectors with the coefficients of each feature vector.

Furthermore, not all markers are equally important in prediction. Therefore, it is possible to train an image-by-image attention weightingPCA, through each of the markers with theCLSattentional weighting between markers. Finally, this can be done by storing the min/max floating-point values of each vector and its coefficients, and quantizing them into8bits or16bit unsigned integers for further compression. By storing in this way only fivePCAvectors and their coefficients, it is possible to replace the50classifier for individual things or people, general, catch-all classifier768Dimensional markers (\(7\times 7\) patch tag +CLSmarkers) are stored from the153KBytes reduced to5Kbytes, while the difference in prediction accuracy is less than1%。

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

AnytimeCL: A New Scheme for Increasing Difficulty and Supporting Arbitrary Continuous Learning Scenarios | ECCV'24

mould

original model

tuning model

Online category weighting (OCW）

Storage Efficiency and Privacy

Online category weighting (`OCW`）