New Ideas, Diffusion-based Initialized Weight Generation Strategy

Good weight initialization can effectively reduce the risk of deep neural networks (DNN) training cost of the model. The choice of how to initialize the parameters is a challenging task and may need to be adjusted manually, which can be both time-consuming and error-prone. To address these limitations, the paper takes the innovative step of building a weight generator to initialize weights for synthetic neural networks. Employing an image-to-image conversion task, a generative adversarial network (GAN) as an example, as this aspect of the model weight collection is relatively simple.

Specifically, a dataset containing various image editing concepts and their corresponding training weights is first collected, and these datasets are subsequently used for the training of the weight generator. In order to cope with the different characteristics between layers and the large number of weights to be predicted, the weights were divided into equal-sized blocks and an index was assigned to each block. Subsequently, such datasets using textual conditions (i.e., concepts) and block indexes were used to train the diffusion model. By initializing the image transformation model with the denoising weights predicted by the diffusion model, the training only takes43.3Seconds. As opposed to training from scratch (i.e.Pix2pix) achieved accelerated training time for new concepts compared to15\timeseffect, while obtaining a better quality of image generation.

discuss a paper or thesis (old): Efficient Training with Denoised Neural Weights

Paper Address:/abs/2407.11966
Thesis Code:/denoised-weights

Introduction

Efficient training of deep neural networks (DNN) not only speeds up the model development process, but also reduces the requirement of computational resources and cost. Many previous studies have explored efficient training strategies such as sparse training and low-bit training. However, achieving efficient training is often hindered by the challenge of initializing model weights efficiently. Although some steps have been taken in the area of weight initialization, it remains challenging to identify suitable schemes in different tasks. Adjusting the parameters for weight initialization can be time-consuming and error-prone, leading to poor performance and increased training time.

In order to address these challenges, the impact of the recentHyperNetworksInspired by advances in design, the paper investigates for the first time the feasibility of constructing a weight generator to provide better weight initialization across different tasks, reducing the need to obtain well-trainedDNNtraining time and resource consumption required by the model. Take a look at the training time and resource consumption required to train a model using theGANThe model-trained image-to-image translation task is used as an example to unfold the design in terms of predicting neural weights. Note that the framework of the paper is a generalized design and is not limited to generatingGANweights, this example was chosen because of the easy access to a large number of different weights trained on different datasets.

More specifically, the weight generator can predict initialized weights for unseen new concepts and styles. In order to reduce the number of weights to be predicted, a low-rank adaptation (LoRA) is applied to the image generation model, thus drastically reducing the number of model parameters while maintaining high quality image generation. SinceGANThe model consists of different types of layers and with different sizes and numbers of weights, which are grouped and divided into blocks of equal size. The diffusion process is utilized to modelGANThe trained weight space of the model is used for weight estimation by training the diffusion model, i.e., weight generator. To improve the performance of the weight generator, block indexing is further used as a conditional mechanism in the weight generator, using a sinusoidal positional coding scheme and computing the embedding of the block index. This embedding provides the weight generator with information about the position of each weight block among all model weights. After obtaining the weight generator, in order to train a weight generator based on theGANof the image translation model, the weight generator is quickly inferred through a single-step denoising process and the predicted weights are used to initialize theGANModel.GANThe models only require a subsequent efficient fine-tuning process to obtain high-quality image generation results, significantly reducing the time consuming process of obtaining novel and unseen conceptual models.

The paper's contributions are summarized below:

A framework is proposed for generating weight initializations for different concepts/styles to efficiently train image translations for theGANModel.
With the help of diffusion modeling (i.e., preparing paired image datasets), a large number of different conceptual/stylisticLoRAweights to a real dataset, which lays the foundation for the training of the weight generator.
By utilizing the diffusion process, an efficient weight generator design is introduced which takes textual conceptual information and block indexes as input. To handle different layer types and weight shapes, the weights are organized into one-dimensional blocks of equal size, which significantly reduces the computational overhead. By combining the block index with the time step (time step) embedding is combined and these block indexes are seamlessly integrated into the weight generator design. As a result, the weight generator holds information about the position of each weight block in all model weights.
The proposed framework can be predicted by a single denoising stepGANThe initialized neural weights of the model, which only need to be$1.19$ Seconds. By using the predicted weights for initialization, the fast fine-tuning process can be performed within$42.1$ seconds to convey the target style. As opposed to training from scratch (i.e.Pix2pix) compared to a reduction in total training time of$15\times$ , while maintaining better quality of image generation. Compared to other efficient training methods, it can save$4.6\times$ of training time.

Motivations and Challenges

Effective weight initialization is crucial for stable training and can promote faster learning rates, accelerate convergence, and enhance generalization. However, determining good weight initialization in different tasks remains challenging. Inspired by the recent work on hypernetworks (HyperNetwork) progress is inspired, the paper wishes to investigate whether a weight generator can be constructed to obtain good weight initialization, thus reducing training time and resource consumption. Unlike popular image/video generation, relatively little research work has explored weight generation. Building such a weight generator is promising but challenging.

The first major challenge comes from deep neural networks (DNN) different types of layers in the architecture. The weights of each layer have different sizes and shapes, which requires a weight generation method that can accommodate this heterogeneity. Second, the weight generator must have the ability to efficiently generate a large number of parameters to ensure comprehensive coverage of the network. Third, the inference process of the weight generator should be fast and efficient to save time in obtaining weights for new tasks.

Addressing these challenges promises to build more efficient and effective deep learning systems ofDNNtraining model. Therefore, in this study, the thesis investigates the construction of weight generators for better weight initialization. The thesis aims to show that the weight generation capability is not limited to weight initialization for a single model architecture on a specific dataset, such as the weights based on the inCIFAR-10upperResNet-18, but rather multiple models applicable to different tasks. In order to achieve this goal toGANsInitialization of weight generation in image-to-image conversion tasks is exemplified as the collection of diverse datasets is used toGANmodeling is relatively easy, but the paper's approach is not limited to theGANarchitecture or image-to-image conversion tasks.

Method

The goal is to train a weight generator to predict weight initialization for different tasks. TakingGANsAs an example of an application in an image-to-image conversion task, when a new concept/style emerges, the weight generator can be queried to provide the weight values needed for initialization. The weight generator is modeled using a diffusion process, as shown in Figure1Shown.

Unlike the image diffusion model that inverts a clean image from pure noise, the framework aims to transform the noise into weight values for initialization. By inserting the predicted weight values, a fast fine-tuning process is carried out to achieve the target style of theGANEfficient training of models. The core of the framework is the design of the weight generator.

Dataset Collection

In order to efficiently train a weight generator for generating different concepts of theGANThe weight initialization of the model requires the collection of a large-scale dataset of true weight values. In order to obtain a true weight value dataset, a large-scale cue dataset is particularly important. By using the concepts/styles in the cue dataset, the diffusion model can be utilized for image collection to obtain a rich collection of representative images for each target concept. The images of each concept/style are further utilized to train theGANto get the realGANWeights.

As a basis for data preparation for weight generator training, the cue dataset should include diverse visual concepts/styles to enable the weight generator to learn comprehensive representations for initializing task-specificGAN. However, the process of collecting such a dataset faces enormous challenges. Ensuring diversity and representativeness among different concepts/styles requires a large amount of data. In addition, the collected cues are further used to generate images of the target concepts/styles using diffusion modeling.

In order to construct the cue dataset to train a reliable weight generator for theGANWeights are initialized using a systematic approach that combines a large language model (LLMs) for style generation and enhancement to ensure richness and diversity of conceptual expression. Three broad categories are first outlined:1) The concept of art.2) The concept of characterization, and3) facial modification concept. In each category, a large language model was utilized (ChatGPT-3.5) to request the generation of a series of textual descriptions containing various concepts. By filtering redundant concepts/styles and further by querying another large language model (Vicuna) to implement enhancement methods to provide concepts/styles with similar meanings but different representations. To further enrich the cue dataset, concepts/styles were also ranked and combined across categories. This process enables the construction of a large-scale cue dataset that not only covers diverse conceptual domains but also captures complex stylistic differences, providing a better basis for weight initialization for the training of the weight generator.

After the cued dataset was collected, the real images were edited using the diffusion model to obtain an edited image of each concept/style in the cued dataset to form the image used for theGANtrained data pairs. Here, a mixture ofResNetblocks andTransformerThe block generator (E2gan) as a training model, the effectiveness of this model and the hybrid architectural design are able to demonstrate the generative capabilities on different types of layers. InGANAfter the training process, build a concept/style for the different concepts/styles from theGANWeighted dataset of checkpoints. To further enhance the weight value dataset, a checkpoint is added to theFIDAfter metrics converge, save for each concept/style$K$ Checkpoints.

Data Format Design for Weight Generator

In order to train a system that can efficiently generate concepts applicable to differentGANThe weight generator for model weight initialization is very important to design the weight format for training and inference. The goal is that whenever a new concept is provided as input to the weight generator, it is able to generate weight initializations for all layers for that concept. Considering that there are many different types of layers in the model, such as fully connected layers (FC), convolutional layers (CONV) and batch normalization layers (BN), as well as size and dimensionality differences between layers, designing appropriate data formats becomes critical and challenging. In addition.GANThe scale of weights in the model is typically in the millions, which creates additional challenges for data format design.

The larger the number of weights to be predicted, the more difficulties the weight generator faces. To mitigate this problem, applying low-rank adaptation to different layers (LoRA) to significantly reduce the number of weights to be predicted. For example, for a weight of$\mathbf{w}_i \in \mathbb{R}^{c\times f \times k_h\times k_w}$ The convolutional layer of the$i$ , applying the two ranks as$r_i$ The low-rank matrices of$\mathbf{w}_{i}^A \in \mathbb{R}^{c\times r_i \times k_h \times k_w}$ act asLoRALower.$\mathbf{w}_{i}^B \in \mathbb{R}^{r_i \times f \times 1\times1}$ act asLoRAupper layers to approximate the weight changes. By doing so, the total number of weights to be predicted changes from the7.06MReduction to0.22MThe Fine tuningLoRAWeighting is sufficient to shiftGANThe generative domain of the model, while greatly reducing the number of weights, is directly predicted at once by the weight generator for all the0.22MWeighting remains challenging. It requires a large weight generator and imposes a huge computational and memory burden.

To address this problem, the weights are divided into groups to reduce computational complexity and enhance the feasibility of adapting the weight generator to memory during training and inference. Since different layers have different statistical properties, dividing each layer$i$ (used form a nominal expression)LoRALower and upper tiers (and, if applicable, relevantBNlayers) into one group. Nonetheless, the number and shape of weights in each group remains different. Therefore, the weights are further spread as one-dimensional vectors and the weights are divided into$N$ Each block contains$b$ A weight.

Thus, the data format is represented as$<n, \mathbf{w}_n, T>$ which$n$ is a block index.$\mathbf{w}_n \in \mathbb{R}^b$ is the first$n$ Spreading one-dimensional weight vectors for individual weight blocks.$T$ A textual cue indicating the current concept/style. Advantages of using this data format include:1) for different types and shapes of layers;2) reduces computational complexity and prediction difficulty;3) makes it easier to adapt the weight generator to memory.

Weight Generator Training

Use the paper's dataset of weight values to train a generative model that learns to provide weight initialization for other concepts/styles. Through a diffusion process theGANof the weight initialization space for modeling. The generator is aUNetWeighting Information Generator$\hat{\mathbf{\epsilon}}_\theta$ Its parameters are$\theta$ which is used for one-dimensional vectors, as shown in2Shown. Place the weight block$\mathbf{w}_n$ From the true weight distribution$p(\mathbf{w}_n)$ Diffuse (iteratively) into a noisy version and train denoisingUNetGradually, this process is reversed to generate weights from Gaussian noise. The training can be formalized as the following noise prediction problem:

\[\begin{equation} \min_\theta \mathbb{E}[\| \hat{\epsilon}_\theta(\mathbf{w}_n^t,t,n,\tau(T)) - \mathbf{\epsilon} \|_2^2], \end{equation} \]

included among these$t$ Indicates a time step;$\epsilon$ It's real noise;$\mathbf{w}_n^t = \alpha_t \mathbf{w}_n + \sigma_t \epsilon$ blocks$n$ The noise weights of the$\alpha_t$ cap (a poem)$\sigma_t$ The signal and noise strengths, respectively, are determined by the noise scheduler;$\tau$ is a frozen text encoder such asCLIP。

In order to use block indexing as a further conditioning mechanism in the weight generator, a sinusoidal positional encoding from the commonly used in sequence-to-sequence models is used. A sinusoidal block index encoding is computed, which is used to provide information to the weight generator about the position of each weight block among all model weights. Specifically, let$N$ denotes the total number of weight blocks.$d$ Indicates the dimension of the encoding. Block Index$n$ The sinusoidal block index code of$\text{SinEnc}(n, d)$ The calculations are as follows:

\[\begin{equation} \text{SinEnc}(n, 2i) = \sin\left(\frac{n}{10000^{2i/d}}\right), \text{SinEnc}(n, 2i + 1) = \cos\left(\frac{n}{10000^{2i/d}}\right), \end{equation} \]

included among these$i$ through (a gap)0until (a time)$\left\lfloor\frac{d-1}{2}\right\rfloor$ .. Input sinusoidal encoding into the embedding layer to obtain block index embedding$emb\_n$ , embedding the block index in the$emb\_n$ Embedding with timesteps$emb\_t$ The combination, denoted as$emb = emb\_n + emb\_t$ in order to be used in each residual block of the generator. Thus, the weight generator has access to the block indexes throughout the denoising process. According to the results, the paper observes that the block index$n$ The ability to efficiently model weights from different blocks without having to rely on previously predicted weights significantly reduces the amount of computation.

Fast Fine-Tuning with Generated Weight Initializations

When a new concept/style emerges$T$ When this is done, it can be done by applying a weight block to each$n$ Perform trained weight generator$\hat{\epsilon}_\theta$ of inference to obtain the weight initialization. To obtain the weight initialization quickly, a direct reconstruction method is used to avoid the iterative denoising process. More specifically, the selected bias noise time step$t$ Inference denoising diffusion model to predict noise$\hat{\epsilon}_\theta(\mathbf{w}_n^t, t, n, \tau(T))$ and perform direct restoration for true weights$\mathbf{w}_{n}=\mathbf{w}_{n}^0$ ：

\[\begin{equation} \mathbf{w}_{n}^0 = \frac{1}{\alpha_t} \mathbf{w}_{n}^t - \sigma_t \hat{\epsilon}_\theta(\mathbf{w}_n^t,t,n,\tau(T)). \end{equation} \]

In the case of all$N$ After reasoning with the individual weight blocks, the concept/style can be obtained$T$ The weight initialization of the$\{\mathbf{w}_{n} \}_{n=1}^N$ 。

To better capture the details of the new concept/style, utilize conditionalGANLosses toGANThe weights are further fine-tuned as follows:

\[\begin{equation} \begin{aligned} &\min_{\mathbf{w}_{lora}} \max_{\mathbf{w}_d} \lambda \underbrace{ \mathbb{E}_{\mathbf{x},\tilde{\mathbf{x}}^T,\mathbf{z}, T} \left[ \| \tilde{\mathbf{x}}^T - \mathcal{G}(\mathbf{x}, \mathbf{z}, T;\mathbf{w}_g,\mathbf{w}_{lora}) \|_1 \right]}_{\textrm{$\ell_1$ loss}} + \\ &\underbrace{\mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}^T} \left[\log \mathcal{D} (\mathbf{x},\tilde{\mathbf{x}}^T; \mathbf{w}_d) \right] + \mathbb{E}_{\mathbf{x},{\mathbf{z}}, T} \left[\log (1- \mathcal{D} (\mathbf{x}, \mathcal{G}(\mathbf{x},\mathbf{z}, {T}; \mathbf{w}_g); \mathbf{w}_d)) \right]}_{\textrm{conditional GAN loss}}, \end{aligned} \end{equation} \]

included among these$\tilde{\mathbf{x}}^T$ Representation of target style-based concepts generated by diffusion modeling$T$ The image of the$\mathcal{G}$ is with original weight$\mathbf{w}_g$ cap (a poem)LoRAweights$\mathbf{w}_{lora}$ The generator of the$\mathcal{D}$ represent$\mathbf{w}_d$ parameterized discriminator functions.$\mathbf{z}$ is the random noise introduced to increase the randomness of the output.$\lambda$ can be used to adjust the relative importance between two loss items.

During fine-tuning, the generator only optimizes the use of the prediction$\{\mathbf{w}_{n} \}_{n=1}^N$ initializedLoRAweights$\mathbf{w}_{lora}$ . By initializing from the predictionGANweights, can use fewer training cycles to achieve the same or betterFIDPerformance. In addition to post-prediction fine-tuning, it is also considered that Eq.4hit the nail on the headGANTraining losses are incorporated into the formula1in the weight prediction loss. However, through experiments, the paper found that combining these two loss terms does not provide better performance, but instead increases the computational cost of training the weight generator.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

New Ideas, Diffusion-based Initialized Weight Generation Strategy | ECCV'24

Dataset Collection

Data Format Design for Weight Generator

Weight Generator Training

Fast Fine-Tuning with Generated Weight Initializations