Cross-modal transfer aims to utilize large pre-trained models for tasks that may not fall within the modality of the pre-trained data. Existing research has had some success in extending classical fine-tuning to cross-modal scenarios, but still lacks an understanding of the impact of modal gaps on transfer. In this work, a series of experiments on the quality of source representations during transfer were conducted, revealing a link between larger modal gaps and less knowledge reuse, implying poor transfer. Then, using conditional distributions\(P(Y|X)\) This gap is formalized as knowledge misalignment between different modalities. To address this problem, the paper proposes "modal knowledge alignment" (
MoNA
) method, which is a meta-learning approach designed to learn target data transformations to reduce modal knowledge differences prior to transfer. Experiments demonstrate that the paper's approach achieves better reuse of source modal knowledge in cross-modal transfers, thus improving on existing fine-tuning methods.
discuss a paper or thesis (old): Learning Modality Knowledge Alignment for Cross-Modality Transfer
- Paper Address:/abs/2406.18864
Introduction
Transferring knowledge from past experiences to new tasks is a fundamental capability of human intelligence. In the machine learning community, this ability to acquire and reuse knowledge is constantly pursued with the aim of building AI systems that can predict more accurately and learn from data more efficiently. Today, the use of such pre-trained models as powerful feature extractors for new tasks has become a common practice in migration learning, as large base models trained on large amounts of data are widely available. Naturally, the pretrained model and the downstream task come from the same modality, e.g., theImageNet
Visualization of upward pre-trainingTransformer
model andCIFAR-100
classification tasks. However, recent research has attempted to extend to cross-modal transfer, for example using visualTransformer
Perform audio categorization and fine-tune language models for tabular data.
The motivation for such cross-modal transfer is easy to understand, especially when data for the target modality is scarce. Scientific tasks, such as ECG classification and protein distance prediction, have difficulties in collecting large amounts of training data and further require expensive annotation costs by human experts. In such cases, it is desirable to utilize pre-trained models of other modalities whose data are easier to collect (e.g., vision and language) to aid in the target modality task. However, cross-modal transfer is not as straightforward as intramodal transfer due to two challenges:1
) The cross-modal input space is different from the label space, the2
) The knowledge required to solve different modal tasks may also differ.
Previous research has addressed the first challenge by designing modality-specific embedders and predictors to interface with pre-trained models from input to output. However, the second challenge has not been well addressed. Some approaches argue that large pre-trained models can be used as generalized encoders and freeze the pre-trained models during the fine-tuning process. Other approaches fine-tune both the pretrained model and the modality-specific components. Both approaches empirically show that pretrained models can be migrated to other modalities. However, the question of which knowledge in the source modality is transferred via the pretrained model, and how this knowledge benefits the target modality, remains an unresolved core issue. For example.ORCA
It was observed that training the model from scratch on certain target modal tasks was even better than performing ordinary fine-tuning on the pre-trained model. This suggests that the knowledge contained in the pre-trained model may not improve the target performance if not transferred appropriately.
In this work, the paper delves into the second challenge of cross-modal transfer. Experiments are first conducted to investigate how target modal fine-tuning affects the quality of the representation of the source modal data. The thesis observes that fine-tuning pre-training on a number of target-modal tasksSwin Transformer
can helpSwin
The encoder extracts more discriminative image features, while fine-tuning on other modalities weakens this ability. This empirical observation suggests that, to varying degrees, there may be knowledge facets between modalities, termed modal semantic knowledge, which can affect the effectiveness of cross-modal transfer.
To clarify this aspect of the differences between modalities, the modal semantic knowledge is interpreted as a conditional probability distribution\(P(Y|X)\) .. The conditional distribution of the source modality is modified according to the task of the target modality so that the two can be compared. Thus, it is possible to formalize modal knowledge differences as differences between the conditional distributions of the source and target modalities. When the target conditional distribution is similar to the modified source conditional distribution, the modal semantic knowledge is said to be aligned, and the source differentiation function learned by the pretrained model can be reused for the target modality. Conversely, modal semantic knowledge contradicts each other and may not be mutually reinforcing, which explains theORCA
Observations in the middle.
The paper's explanation provides a new perspective on understanding the effectiveness of the two-stage tuning process proposed by previous works on cross-modal transfer: the first stage is viewed as implicitly learning a data transformation for the target modality, which results in the conditional distribution of the transformed data being more aligned with the source data. As a result, this revelation allows for the learning of an appropriate target embedding function directly prior to fine-tuning, helping to minimize knowledge misalignment. Based on this, the paper proposes a new methodMoNA
, cross-modal transfer is improved by two-stage training. In the first phase, theMoNA
Meta-learning is utilized to learn an optimal target embedder, which is used as one of the initializations during full-scale fine-tuning, combined with pre-trained weights to achieve maximum reuse of source modal knowledge during full-scale fine-tuning. In the second stage, the learned target embedder is utilized as a starting point to follow the traditional fine-tuning approach and update all parameters to fit the target task while maximizing the use of source knowledge.
The paper is presented in two cross-modal transfer benchmark datasetsNAS-Bench-360
cap (a poem)PDEBench
A large number of experiments were conducted on to validate the hypotheses and the effectiveness of the proposed methodology. Both benchmark datasets focus on modalities relevant to the scientific problem, where the scarcity of training data is particularly acute. For theMoNA
Comparisons were made with the previous method and the experimental results showed that the method performed well.
Problem Formulation and Analysis
Introduction to basic notations and architecture
Consider the source modes\(\mathcal{M}^s\) and target modes\(\mathcal{M}^t\) Knowledge transfer between Data in the source modality (e.g., visual or linguistic data) is more accessible and less expensive, while large pre-trained models are publicly available. In contrast, the target modality data is insufficient to pre-train its own large models. The two modalities differ in both input space and labeling space, i.e., the\(\mathcal{X}^s\neq\mathcal{X}^t\) , \(\mathcal{Y}^s\neq\mathcal{Y}^t\) .. Cross-modal transfer aims to utilize the source pre-training model (consisting of the parameter\(\boldsymbol{\theta}^{\mathcal{S}}\) parameterization) to help process a target task that has only a small set of labeled data\(\{\boldsymbol{x}_i^{t}, y_i^{t}\}_{i=1}^{n_t}\) 。
Based on previous studies, the model structure\(g_{\boldsymbol{\theta}}\) Includes an embedder\(e(\cdot;\boldsymbol{\theta}_e)\) OneTransformer
encoders\(f(\cdot;\boldsymbol{\theta}_f)\) and a predictor\(h(\cdot;\boldsymbol{\theta}_h)\) , the parameters of the whole model are expressed as\(\boldsymbol{\theta} = \{\boldsymbol{\theta}_e, \boldsymbol{\theta}_f, \boldsymbol{\theta}_h\}\) . In particular, pre-trainedTransformer
has its own embedders and predictors, so the pre-training weights of the source model are denoted as\(\boldsymbol{\theta}^{\mathcal{S}}_0 = \{\boldsymbol{\theta}_{e_0}^\mathcal{S}, \boldsymbol{\theta}_{f_0}^\mathcal{S}, \boldsymbol{\theta}_{h_0}^\mathcal{S}\}\) 。
Embedders map input data to a shared input embedding space\(\hat{\mathcal{X}}\) , the encoder extracts features from the embedded input. The predictor is a linear layer that maps the output of the encoder onto the label space. For the target model\(g_{\boldsymbol{\theta}^\mathcal{T}}: \boldsymbol{\theta}^\mathcal{T} = \{\boldsymbol{\theta}^\mathcal{T}_e, \boldsymbol{\theta}^\mathcal{T}_f, \boldsymbol{\theta}^\mathcal{T}_h\}\) , both embedders and predictors are specifically redesigned to fit the input and label space of the target task while using the\(\boldsymbol{\theta}_{f_0}^\mathcal{S}\) to initialize the encoder weights\(\boldsymbol{\theta}^\mathcal{T}_f\) 。
The flexibility of this architecture enables end-to-end training on the target task by simply fine-tuning all parameters of the target model by minimizing task-specific losses on a given training dataset:
Here.\(\ell\) is the task loss function, e.g., cross-entropy.
Learning directly from target supervision in this way encourages the model to learn knowledge that helps to discriminate between the target data. Since the pre-trained model already contains discriminative knowledge of the source domain, cross-modal transfer naturally expects the knowledge of the source and target domains to be similar in some way so that the knowledge of the source domain can be reused to facilitate target learning. Next.1
) conducted experiments to show that this similarity depends on the modality of the2
) provide explanations of modal knowledge and formalize knowledge differences.
-
Detailed Explanation of the Model Architecture
The modality-specific embedders and predictors are implemented exactly as in theORCA
The design in is carried out.
The structure of a modality-specific embedder depends on the task being2D
nevertheless1D
。
-
insofar as
2D
task, the embedder consists of a linear projection layer and aLayerNorm
operation consists of. For any operation of size\(C\times H\times W\) The input data of the\(C\) 、 \(H\) cap (a poem)\(W\) The number of channels, height and width, respectively. First resize it to\(C\times 224^2\) and divided into parts of size\(C\times 4^2\) (used form a nominal expression)\(N\) image blocks, and then the linear projection layer maps each image block to an image block of size\(128\) The marking of theLayerNorm
operations are applied to all mapped image blocks. Thus, the embedder can be represented as a function\(e_{2D}: \mathbb R^{N\times 16C}\to\mathbb R^{N\times 128}\) 。 -
insofar as
1D
task, the embedder consists of a linear projection layer,LayerNorm
operations and learnable positional embeddings are composed. For any operation of size\(C\times L\) The input data of the\(C\) cap (a poem)\(L\) denote the number of channels and the length of the sequence, respectively. It is first divided into sequences of size\(C\times \frac{L}{N}\) (used form a nominal expression)\(N\) blocks, and then the linear projection layer maps each block to a block of size\(768\) The tokens of theLayerNorm
operation is applied to all projected blocks and finally the positional embedding is added to the block. Thus, the embedder can be represented as a function\(e_{1D}: \mathbb R^{CL}\to\mathbb R^{768N}\) 。
The structure of the modality-specific predictor depends on whether the task is classification or dense prediction.
-
For the classification task, the predictor consists of an average pooling layer and a linear projection layer. The average pooling layer will be of size\(N'\times d\) The dense feature maps are averaged as a graph of size\(d\) features, and then the linear projection layer maps the features to features of size\(K\) The logarithmic value of the\(d\) cap (a poem)\(K\) denote the feature dimension and the number of categories, respectively. Thus, the predictor can be represented as a function\(h_{c}: \mathbb R^{N'd} \to \mathbb R^K\) 。
-
For the dense prediction task, the predictor consists of a linear projection layer, a pixel rearrangement operation, and two adaptive pooling layers. The linear projection layer starts with a size of\(7^2\times d\) of dense feature maps as input and an output of size\(7^2 \times 3072\) The new features are then reorganized into the shape of the\(224^2 \times 3\) . Next, the two pooling operations are applied sequentially, increasing the feature size from\(3 \times 224^2\) change into\(K \times 224^2\) The final change to\(K \times H \times W\) that matches the spatial dimension of the input. Thus, the predictor can be expressed as a function\(h_{d}:\mathbb R^{49d}\to\mathbb R^{KHW}\) 。
Distortion of learned source modality knowledge
The paper looks for a method to quantitatively compare the degree of knowledge reuse in different cross-modal transfer scenarios. The image modality is chosen as the knowledge source and four target tasks are selected from different modalities, including two tasks that are closely related to images:CIFAR-100
, which contains the spherical projected image, and two tasks that are not similar to the image modality: representing the gestures of theNinaPro
and the audio clip containing the sound eventFSD50K
. Specifically, the use of the inImageNet-22k
upfrontSwin Transformer Base
as a source model and check the properties of the model after fine-tuning it on different tasks.
Considering that comparisons are made across modalities, there is a lack of a common metric to measure the degree of knowledge reuse during transfer. Therefore, the comparison was turned to the degree of distortion of the source knowledge. Specifically, if more source knowledge is reused to solve the target task, the distortion is considered to be smaller and vice versa. Therefore, using the pre-trained source model to extractCIFAR-10
of the visual representation, which is an alternative image dataset not seen by the model. The samples in this particular source dataset are represented as\(\{\boldsymbol{x}^s_i,y^s_i\}\) and its corresponding feature set is\(\{\boldsymbol{f}^s_i = f(e(\boldsymbol{x}^s_i;\boldsymbol{\theta}_{e_0}^\mathcal{S});\boldsymbol{\theta}_{f_0}^\mathcal{S})\}\) . Then, separately, using the formula1
The pre-trained model was fine-tuned on the four target tasks. After the fine-tuning process, the fine-tuned encoder was again utilized to extract theCIFAR-10
The representation of the\(\{\boldsymbol{f}^s_i(\mathcal{M}_t) = f(e(\boldsymbol{x}^s_i;\boldsymbol{\theta}_{e_0}^\mathcal{S});\boldsymbol{\theta}_f^\mathcal{T}, \mathcal{M}_t)\}\) 。
seek2
Shows five different sets ofCIFAR-10
image-featuredT-SNE
Visualization results. The figure shows that the fine-tuning inCIFAR-100
maybeSpherical
The encoder on theCIFAR-10
image samples maintains or even improves their distinguishability, while in theNinaPro
cap (a poem)FSD50K
The encoder that is fine-tuned on the target modality fails to extract the category discriminative features that are applicable to the image. Considering that fine-tuning on the target modality causes the encoder to focus on categorizing the target data and learning the target discriminant function, this observation suggests that compared to the latter two modalities for discriminating theCIFAR-100
cap (a poem)Spherical
The knowledge required for the sample is similar to that used forCIFAR-10
knowledge required for the sample is more consistent. Such a conclusion is in line with the intuition of the thesis, since theCIFAR-100
is a visual dataset.Spherical
derived from natural images, and theNinaPro
cap (a poem)FSD50K
Low correlation with images.
On the other hand, the results show thatCIFAR-100
cap (a poem)Spherical
can better reuse the source knowledge from the pre-trained encoder to solve the task, and theNinaPro
cap (a poem)FSD50K
Greater adaptation of the encoder is required to fit the target mission.
In order to more fully quantify the reuse (or distortion) of source knowledge in the cross-modal transfer process, theCIFAR-10
on using a linear probe to assess the quality of the encoder-extracted representations using different target modal fine-tuning, considered separately:1
) different fine-tuned target modes.2
) different numbers of training rounds, and3
) different transfer methods. In addition to ordinary fine-tuning, the following two baselines are considered:
-
ORCA
An embedder training phase is added before fine-tuning. The first phase updates only the target embedder parameters\(\boldsymbol{\theta}_e^\mathcal{T}\) In the shared input space\(\hat{\mathcal{X}}\) in minimizing the optimal dataset transfer distance between the source and target embeddings (the distributions are as similar as possible after the embedder processing). - The paper proposes an alternative baseline approach, modified from previous work
Embedder warmup
(Emb
), which is also a two-stage training method. The first stage updates the target embedder only by using the same task loss as normal fine-tuning, while keeping the rest of the network frozen. The second stage fine-tunes the entire network.
seek3
shows the error rate for linear probing, and the dashed line indicates the linear probing results on the pre-trained encoder as a reference. Note that all these results were obtained on theCIFAR-10
The error rate on the dataset, which reflects the extent to which the model retains knowledge of the source modality, is not concerned with comparing the performance of the target modality for the time being. From the experiments, it is observed that the modality has the largest impact on the linear detection results. In theFSD50K
Performing fine-tuning on the encoder significantly distorts the encoder and impairs its distinguishability on the image data. Performing more rounds of fine-tuning on the target dataset leads to greater distortion of source knowledge for all target modalities except image modalities (CIFAR-100
). These observations lead to the conclusion that the knowledge that distinguishes samples across modalities differs to varying degrees, referring to this as a mismatch in modal semantic knowledge. The paper argues that large differences may impede the effectiveness of cross-modal transfer, and thus the hypothesis that source modality pretraining is beneficial for the target modality should depend on such differences.
The paper makes additional observations on the source knowledge retention effects of the two-stage training method. Compared to the normal fine-tuningORCA
cap (a poem)Emb
Both achieve lower source error rates, andEmb
outperformsORCA
. This suggests that, in their first stage of training, the target embedders implicitly learn a sequence from the\(\mathcal{X}^t\) until (a time)\(\hat{\mathcal{X}}\) mapping that mitigates the knowledge mismatch between the target and the source, and consequently reduces the distortion of the model during adaptation to the target task.
Modality semantic knowledge discrepancy
Consider the distribution of conditions of use\(P(Y|X)\) to represent the semantic knowledge within a modality, the conditional distribution describes the relationship between the original data space and the semantic space of the modality. This is because for neural networks, acquiring semantic knowledge means learning a mapping from the data space to the semantic space that is analogous to a real conditional distribution.
However, measuring the consistency or "similarity" of this knowledge between two modalities is very challenging. The difficulty is that the data space\(\mathcal{X}\) and label space\(\mathcal{Y}\) are even different and non-overlapping between modalities. Therefore, the conditional distributions need to be modified to make them comparable across modalities. Modifying the input space is relatively easy, as the inputs can be embedded into a shared space using modality-specific embedders\(\hat{\mathcal{X}}\) in. However, modifying the label space is more complicated.
Considering that the source modalities (e.g., vision and speech) have large pre-trained models and are both semantically rich, the paper makes the following assumption: the base of the source modality labeling space is larger than the base of the target modality labeling space, which is\(|\mathcal{Y}^s| = |\mathcal{Y}^t|\)。
This assumption is easily met in practice. For exampleImageNet
Upper trained visionTransformer
It is possible to learn a discriminant function containing one thousand categories, while only four categories are considered in the ECG classification task. Based on this assumption, the source modal labeling space can be chosen\(\mathcal{Y}^s_{\mathcal{B}} \subset \mathcal{Y}^s\) is a subset of the set such that\(|\mathcal{Y}^s_{\mathcal{B}}| = |\mathcal{Y}^t|\) . A further category substitution is introduced\(\pi(\cdot)\) that reorder the source categories. Thus, it is possible to define a new source modal labeling space, i.e., a subset of the sources after the replacement\(\mathcal{Y}^s_{\pi,\mathcal{B}} \triangleq \pi(\mathcal{Y}^s_{\mathcal{B}})\) . By measuring the modified conditional distribution\(P(Y^s_{\pi,\mathcal{B}}|\hat{X})\) cap (a poem)\(P(Y^t|\hat{X})\) The difference between can formalize the degree of alignment of modal semantic knowledge as follows:
Given source modes that satisfy the assumptions\(\mathcal{M}^s\) and target modes\(\mathcal{M}^t\) set up\(\hat{\mathcal{X}}\) is a shared input space generated by a modality-specific embedder from the original data space.\(P(Y^s|\hat{X})\) , \(P(Y^t|\hat{X})\) are the conditional distributions of the source and target modalities, respectively. Then, the difference in modal semantic knowledge between the two modalities is
where $ d(\cdot,\cdot)$ is an arbitrary measure of difference between two conditional distributions.
The definition basically states that knowledge differences are considered small if an optimal subset of the source semantics can be found with appropriate one-to-one matching between the source and target semantics so that they have similar conditional distributions as the target modality. The source modality should be able to distinguish the target samples as correctly as the source samples are identified within the subset.
With this definition, the paper uses an extreme approximation algorithm to compute the difference in modal semantic knowledge between the image modality and the four target tasks. As shown in Fig.4
shown, is consistent with previous observations, suggesting that the different modalities do indeed have different levels of knowledge variation across the four tasks in which theFSD50K
is the modality least similar to the image modality.
Modality Knowledge Alignment
Discovering the consequences of modal knowledge that may not be well aligned and insufficient reuse of source knowledge, the paper proposes a new approachMoNA
, complete algorithms such as the algorithm1
Shown. The method improves the alignment of modal knowledge and enhances cross-modal transfer.
Embedder Warmup
In previous experiments, the paper found that the embedderwarmup
Despite the simplicity of the training goal, it retains source knowledge better than other methods. Accordingly, tests were begun to test its performance on the target modality, again outperforming its counterpart. The paper argues that the embedderwarmup
During the process, in order to minimize task loss, the embedder is explicitly forced to project the target raw inputs into distinguishable embeddings that are frozen by the source model and extract features based on the source knowledge.
In conjunction with the previous analysis, it is hypothesized that the key to effective transfer is learning a target embedding function\(e^\mathcal{T}: \mathcal{X}\to \hat{\mathcal{X}}\) that makes the target conditional distribution\(P(Y^t|\hat{X})\) more aligned with the source knowledge. Therefore, the paper recommends learning this embedding function in advance of the complete fine-tuning process.
Learning to Align Modality Knowledge
Directly using the modal knowledge difference as an optimization objective is difficult due to the inability to estimate the target conditional probabilities without a training model. As an alternative, it is suggested that a meta-learning process be utilized to simulate the Figure3
in the process and optimize the representation quality of the source data after the fine-tuning. Specifically, an ideal target embedder aligns the modal knowledge such that the encoder maintains its distinguishability on the image data during the target fine-tuning process. Thus, if a source dataset is used to evaluate the fine-tuned encoder initialized by this ideal target embedder, the minimum error on the source data will be obtained.
This process is a standard two-layer optimization problem widely studied in meta-learning. In particular, in the current scenario, the outer loop updates the target embedder based on the outer loop loss, which is computed through the target encoder after the inner loop optimization. Fig.5(a)
Illustrates the embedder parameter in the external loop during meta-learning\(\boldsymbol{\phi}_e\) The single update of Fig.5(b)
The process of bilayer optimization is demonstrated.
More specifically, the inner loop is optimizing the model on the target dataset, subject to the target embedder by the\(\boldsymbol{\phi}_e\) Limitations on initialization conditions
Among them.\(\mathcal{L}_{inner}\) is the same as the formula1
The same loss function as in
This inner loop optimization simulates the complete fine-tuning process in the second stage and returns an encoder that has been adapted to the target modality. Note that the optimization of the entire target model in the inner loop depends on the initialization of the target embedder, and thus there are\(\boldsymbol{\theta}^{\mathcal{T}^*}(\boldsymbol{\phi}_e) = \{ \boldsymbol{\theta}^{\mathcal{T}^*}_e(\boldsymbol{\phi}_e),\boldsymbol{\theta}^{\mathcal{T}^*}_f(\boldsymbol{\phi}_e),\boldsymbol{\theta}^{\mathcal{T}^*}_h(\boldsymbol{\phi}_e) \}\) 。
The outer loop is an optimization problem for the target embedder and the goal is to find the optimal embedder parameters\({\phi}_e^*\) , such that the generated optimal objective encoder\(\boldsymbol{\theta}^{\mathcal{T}^*}_f(\boldsymbol{\phi}_e^*)\) able to produce high-quality representations of the source data. To compute the loss, a small portion of the labeled dataset in the source modality is utilized\(\{\boldsymbol{x}_i^s, y_i^s\}\) as a substitute and compute their characteristics\(\{\boldsymbol{f}^s_i = f(e(\boldsymbol{x}^s_i;\boldsymbol{\theta}_{e_0}^\mathcal{S});\boldsymbol{\theta}_f^{\mathcal{T}^*}(\boldsymbol{\phi}_e))\}\) . These features are then normalized to the unit sphere and the alignment and homogeneity of the source features are measured. Specifically, the alignment loss measures how close features from the same category are, while the uniformity loss measures whether features from different categories are uniformly distributed on the sphere.
The external loop target that measures the differentiability of the encoder source modes has the following form:
It is worth noting that the source knowledge cannot be well preserved at the beginning of the embedder training. In order to prevent the embedder from focusing too much on the source modality and to keep the optimization process stable, the two objectives are jointly minimized by co-minimizing them and introducing the trade-off parameter\(\lambda\) Balance between source and target knowledge learning:
In practice, a simplified single-step update is used in the inner loop, which makes it possible to reuse losses calculated during the simulated update of the inner loop\(\mathcal{L}_{inner}\) to efficiently calculate this portfolio goal\(\mathcal{L}_{outer}^{'}\) . To this end, the paper presents theMoNA
In the first stage, the target embedder is updated using the following formula:
With better alignment of modal knowledge, theMoNA
Ordinary fine-tuning in the second stage.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].