discuss a paper or thesis (old): Training-Free Model Merging for Multi-target Domain Adaptation
- Paper Address:/abs/2407.13771
- Thesis Code:/ModelMerging
innovation point
- A systematic exploration of pattern connectivity in domain-adapted scene resolution models reveals potential conditions under which model merging is effective.
- A model merging technique, including parameter merging and buffer merging, is introduced for multi-target domain adaptation tasks that can be applied to any single-target domain adaptation model.
- Performance comparable to training with multiple merged datasets can be achieved even when data availability is limited.
Content overview
The paper investigates multi-objective domain adaptation of scene understanding models (MTDA
). While previous approaches have achieved promising results through inter-domain consistency loss, they typically assume unrealistic simultaneous access to images from all target domains, ignoring issues such as data transfer bandwidth limitations and data privacy. In light of these challenges, the paper poses the question: how can models independently adapted in different domains be merged without direct access to the training data?
The solution to this problem consists of two parts, namely merging model parameters and merging model buffers (i.e., normalized layer statistics). In terms of merging model parameters, an empirical analysis of pattern connectivity unexpectedly showed that linear merging is sufficient for separate models trained using the same pre-trained backbone weights. In terms of merging model buffers, a Gaussian prior was used to model real-world distributions and new statistics were estimated from the buffers of the separately trained models.
The paper's approach is simple yet effective, achieving performance comparable to the data combination training baseline while eliminating the need to access the training data.
methodologies
Previous methods have assumed the non-realistic assumption that all target domain images can be accessed simultaneously during the adaptation phase. On the contrary, the flow of the thesis method consists of two distinct phases:
- Single-target domain adaptation phase, where models adapted to each target domain are trained separately. A state-of-the-art unsupervised domain adaptation approach is simply used
HRDA
, utilizing various backbone architectures such asResNet
visualizationTransformer
。 - The model merging phase (the main focus), focuses on merging these adapted models together to create a robust model without accessing any training data. The method contains two key components of the model: the parameters (i.e., the weights and biases of the learnable layers) and the buffer (i.e., the running statistics of the normalized layers).
parameter merge
The paper finds through comparative experiments that when starting from the same pre-training weights, the domain adaptation models are able to effectively transition to diverse target domains while maintaining linear pattern connections in the parameter space. Thus, a simple midpoint merge between these training models can generate models that are robust in both domains.
buffer consolidation
buffer, i.e., the buffer used for batch normalization (BN
) The running means and variances of the layers are closely related to the domain, as they encapsulate domain-specific features. While existing methods mainly deal with the merging of two models trained on different subsets within the same domain, the paper investigates the merging of two models trained in completely different target domains, so the problem of buffer merging becomes less straightforward.
BN
layer is introduced to mitigate the problem of internal covariate bias, where the mean and variance of the inputs change as they pass through the internal learnable layer. In this context, the basic consideration is that subsequent learnable layers are expected to merge theBN
The output of the layer follows a normal distribution. Since the output of theBN
layer preserves the inductive bias of inputs conforming to a Gaussian prior, and thus can be derived from the\(\mathbf{\Gamma}_A\) cap (a poem)\(\mathbf{\Gamma}_B\) to estimate the value obtained in the\(\boldsymbol{\mu}^{(i)}\) cap (a poem)\([\boldsymbol{\sigma}^{(i)}]^2\) .. Two sets of the mean and variance of the data points from that Gaussian prior are first obtained, along with the sizes of those sets, and these values are jointly used to estimate the parameters of that distribution.
When extending the merge method to\(m (m \geq 2)\) For a Gaussian distribution, the number of tracked batches can be calculated as follows\(n^{(i)}\) Weighted average of mean values\(\boldsymbol{\mu}^{(i)}\) and a weighted average of the variances.
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].