DataDream: tune it up for the better, LoRA fine-tuned SD-based training ensemble into a new scheme

While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to demonstrate effectiveness in downstream applications. Previous studies have proposed methods for training generated data for image classifiers with limited access to real data. However, these methods have difficulties in generating internally distributed images or depicting fine-grained features, which hinders the generalization ability of classification models trained on synthetic datasets. The paper proposesDataDream, a framework for synthetic categorical datasets, guided by a small number of examples of target categories to more realistically represent real data distributions.

DataDreamBefore generating the training data, the image generation model for theLoRAThe weights are fine-tuned, using a small number of real images. The synthesized data is then used to fine-tune theCLIP(used form a nominal expression)LoRAweights to improve downstream image classification performance over previous methods on a variety of datasets.

Demonstrated through numerous experimentsDataDreamThe validity of the10in each of the datasets7state-of-the-art classification accuracies were exceeded on a small number of datasets with a small amount of example data, while at the same time, the state-of-the-art classification accuracies were exceeded on other3competitive performance on the individual datasets as well. In addition, the paper provides insights on the impact of multiple factors, such as the number of real and generated images and the effect of fine-tuning computations on model performance.

Thesis: DataDream: Few-shot Guided Dataset Generation

Paper Address:/abs/2407.10910
Thesis Code:/ExplainableML/DataDream

Introduction

The emergence of text-to-image generation models such as stable diffusion (Stable Diffusion), which not only enables the creation of photo-realistic synthesized images, but also provides opportunities to augment downstream tasks. One potential application is to train or fine-tune task-specific models on synthetic data. This is particularly useful in domains where access to real data is limited, as generative models provide a cost-effective way to generate large amounts of training data. The paper investigates the impact of synthetic training data on image classification tasks in low sample settings, i.e., when there are only a small number of images available for each category, but the cost of collecting the entire dataset would be prohibitive.

Previous research has focused on using class names of a given dataset to guide the data generation process. Specifically, they used a text-to-image diffusion model to generate images, using class names as conditional inputs. To better guide the model in generating accurate depictions of the target objects, they incorporated textual descriptions of each class into the cues, which were derived from language models or manually labeled class descriptions. Although intuitive, these approaches resulted in some generated images lacking the object of interest. For example, images fromImageNetClass name of the dataset"clothes iron"The real image shows an appliance used for ironing clothes, and theFakeItMost of the images generated depict metal irons or arbitrary objects made from them (see Figure1, left). This occurs when the generative model misinterprets the ambiguity of class names or rare categories. This inconsistency between real and synthesized images limits the informative value of the generated images in image classification and hinders performance.

In order to bridge the gap between real and synthesized images, real images can better provide the generative model with information about the characteristics of the real data distribution. For example, the concurrently being developedDISEFThe method starts with a partially noisy real image when generating a synthetic dataset and inputs a small number of samples as conditions into a pre-trained diffusion model. It also uses a pre-trained image description model to diversify text-to-image cues. While this approach improves the alignment of real and synthetic data distributions, it sometimes fails to capture fine-grained features. For example, although the aerial dataset "DHC-3-800"The real image of the class name contains a propeller in front of the wing, but theDISEFThe resulting composite image lacks this detail (see Figure1, right). Accurate representation of class-distinguishing features may be crucial for classification tasks, especially in fine-grained datasets.

To this end, the paper proposes a new approachDataDreamthat aim to adapt generative models using a small amount of real data. Inspired by personalized generative modeling approaches that fine-tune generative models with a small number of real images depicting the same objects, the approach focuses on aligning the generative model to a target dataset with multiple classes and diverse objects of each class. This is in contrast to previous approaches to generating datasets with a small number of samples, which did not explore the possibility of fine-tuning the generative model.

Specifically, through two ways based onLoRAto adjustStable Diffusion： \(\text{DataDream}_{\text{cls}}\) For each class, the trainingLoRAas well as\(\text{DataDream}_{\text{dset}}\) that trains aLoRA. The paper is the first to propose a method for adapting a generative model using a small amount of sample data to generate synthetic training data, rather than utilizing a frozen pre-trained generative model. After training, images are generated using the same cue that was used to fine-tune theDataDream, the resulting image depicts the object of interest (e.g., a clothing iron) or fine-grained features (e.g., aDHC-3-800(the propeller of an airplane), as shown1as shown in the last line of the

It was verified through extensive experiments that theDataDreameffectiveness, reaching the state-of-the-art in all datasets when using only synthetic data, and in training with both real small numbers of samples and synthetic data in the10One dataset has7individual obtained the best performance. To understand the effectiveness of the method, the paper analyzes the alignment between real and synthetic data, revealing that the method outperforms the baseline method in terms of alignment to the real data distribution. Finally, the scalability of the method is explored by increasing the number of synthetic data points and real samples, showing the potential benefits of larger data sets.

In summary, the contributions of the paper are as follows:

Having introducedDataDream, a novel method for small number of samples that improves theStable Diffusion, to generate better images of like distributions that can be used for downstream training. In the10data sets.DataDreamexist7exceeds the state-of-the-art classification performance on a small number of samples, and the remaining3The performance of the individual datasets is comparable.
Emphasizes the importance of reporting results using only synthetic data. Demonstrating that when training a classifier using only synthetic data, the paper's method is able to achieve superior performance, in some cases even outperforming classifiers trained using only a small number of real sample images, suggesting that the paper's method generates images that are capable of extracting more insightful information from a small amount of real data.
The effectiveness of the thesis method is investigated by analyzing the distributional alignment between the synthetic data and the real data. With a small number of samples, the method generates the best alignment of synthetic data with real data.

Methodology

Preliminaries

Latent diffusion model

The methodology of the paper is based onStable Diffusionimplementation, which is a probabilistic generative model that learns to generate realistic images from textual cues. Given the data\((x,c) \in {\mathcal{D}}\) which\(x\) is an image that\(c\) is a description\(x\) of the title, the model learns the conditional distribution by gradually denoising Gaussian noise in the latent space\(p(x|c)\) . Given a pre-trained encoder\(E\) It takes the image\(x\) Coded as latent variable\(z\) namely\(z=E(x)\) , the objective function is defined as:

\[\begin{equation} \min_{\theta} \,\, \mathbb{E}_{(x,c) \sim {\mathcal{D}}, \, \epsilon \sim {\mathcal{N}}(0,1), \, t} \, \left[\, \left\| \, \epsilon - \epsilon_{\theta} (z_t, \tau(c), t) \, \right\|_2^2 \,\right] \, , \end{equation} \]

included among these\(t\) It's time to step.\(z_t\) is the distance potential variable\(z\) \(t\) step of the potential band noise data.\(\tau\) is a text encoder.\(\epsilon_{\theta}\) is the potential diffusion model. Intuitively, the parameters\(\theta\) Trained to denoise a given text cue\(c\) Potential as conditional information\(z_t\) . In the inference phase, a random noise vector\(z_T\) The potential diffusion modeling was carried out through the\(T\) sub-passage, and with the title\(c\) Together, the denoised latent variables are obtained\(z_0\) . Subsequently, the\(z_0\) Input to a pre-trained decoder\(D\) in order to generate the image\(x'=D(z_0)\) , used for text-to-image generation.

Low-rank adaptation

Low-rank adaptation methods (LoRA) is a fine-tuning method for tuning large pre-trained models to downstream tasks in a parameter-efficient manner. Given pre-trained model weights\(\theta \in \mathbb{R}^{d \times k}\) ，LoRAIntroducing a new parameter\(\delta \in \mathbb{R}^{d \times k}\) that the parameter is decomposed into two matrices.\(\delta=BA\) which\(B \in \mathbb{R}^{d \times r}\) ， \(A \in \mathbb{R}^{r \times k}\) and has a smallLoRAorderliness\(r\) namely\(r \ll \min (d, k)\) 。LoRAweights are added to the model weights to obtain the fine-tuned weights, the\(\theta^{\text{(ft)}} = \theta \!+ \delta\) to adapt to downstream tasks. During training.\(\theta\) Keep it fixed and only update\(\delta\) 。

DataDream method

The goal of the paper is to improve classification performance by utilizing synthetic images generated by the diffusion model, and it is crucial to align the distribution of the synthetic images with that of the real images. This alignment is achieved by adapting the diffusion model to a dataset of a small number of real images.

Assuming access to a dataset with a small number of samples\({\mathcal{D}}^{\text{fs}}=\{(x_i, y_i)\}_{i=1}^{KN}\) which\(x_i\) is an image that\(y_i \in \{1,2,\cdots\!, N\}\) is it labeled.\(K\) is the number of samples in each category.\(N\) is the number of categories. To match the distribution of the real data, a dataset with a small number of samples is used\({\mathcal{D}}^{\text{fs}}\) Perform fine-tuning. Specifically, LoRA weights are introduced in the text encoder and U-net of the diffusion model, where they are chosen to effectively tune the parameters of the attention layers. For each attention layer, consider the query, key, value and output projection matrices\(W_q\), \(W_k\), \(W_v\), \(W_o\), in each matrix, the linear projection is replaced by the

\[\begin{equation} h_{l,\star} = W_{\star} h_{l-1} + B_{\star} A_{\star} h_{l-1} \end{equation} \]

included among these\(h\) denotes the input/output activation of the projection, which ultimately yields each attention layer\(l\) trainableLoRAweights\(\delta^{(l)} = \{A_{\star}, B_{\star} | \forall \star \in \{q, k, v, o\}\}\) . Bias weights are omitted to simplify notation. All other model parameters (including\(W_{\star}\) ) remains unchanged, while\(\delta\) The weights are then optimized by gradient descent.

In order to model diffusion from a pre-trainedcheckpointStart training with the weight matrix\(B_{\star}\) is initialized to zero, and the\(A_{\star}\) are then randomly initialized. Thus, the fine-tuned weights of the combination\(B_{\star} A_{\star}\) initially zero and gradually learn modifications to the original pre-training weights. At the time of testing, theLoRAWeights can be updated by updating the weights\(W^{\text{(ft)}}_{\star} =W_{\star} + B_{\star} A_{\star}\) integrated into the model, making the inference time the same as the pre-trained model. Compare this with theDreamBoothIn contrast, not fine-tuning all network weights and not adding retention loss, whose regularization would prevent strong alignment with the real image.

Consider two settings further:1) \(\text{DataDream}_{\text{dset}}\) In this setup, the entire dataset in the\({\mathcal{D}}^{\text{fs}}\) for training diffusion models onLoRAWeights.2) \(\text{DataDream}_{\text{cls}}\) In this setup, for each category in the dataset, initialize the\(N\) organizeLoRAweights\(\{\delta_n|n=1,\cdots\!,N\}\) , each set of weights is specific to the subset\({\mathcal{D}}^{\text{fs}}_{n} = \{(x,y)| (x,y) \!\in {\mathcal{D}}^{\text{fs}}, y\!=\!n\}\) Conduct training.

exist\(\text{DataDream}_{\text{dset}}\) In the setup, the original model parameters\(\theta\) remains unchanged, with only theLoRAThe weights are trained and the objective function is

\[\begin{equation} \min_{\delta} \mathcal{L}_{\text{D}} = \min_{\delta} \,\, \mathbb{E}_{(x,y) \sim {\mathcal{D}}^{\text{fs}}, \, \epsilon \sim {\mathcal{N}}(0,1), \, t} \, \left[\, || \, \epsilon - \epsilon_{\theta\!, \delta} (z_t, \tau_{\delta}(C(y)), t) \, ||_2^2 \,\right] \, . \label{eq:datadream_loss} \end{equation} \]

exist\(\text{DataDream}_{\text{cls}}\) In the setup, the\({\mathcal{D}}^{\text{fs}}_{n}\) cap (a poem)\(\delta_n\) Replacement of\({\mathcal{D}}^{\text{fs}}\) cap (a poem)\(\delta\) . Since a text-to-image diffusion model is used, by means of the function\(C\) Defines the text condition that will label the\(y\) (i.e., class names) maps to the use of standard templates "a photo of a[CLS]" prompt. This prompt is passed through the text encoder and used in the decoding step of the diffusion model.

Both setups have different advantages. In the\(\text{DataDream}_{\text{dset}}\) In this case, theLoRAWeight sharing allows knowledge transfer about common features across the entire dataset. This is beneficial for fine-grained datasets that share coarse-grained features across categories. On the other hand.\(\text{DataDream}_{\text{cls}}\) More weights were assigned to learning the details of each category, which allowed the generative model to better align with the data distribution of each category.

After adapting the diffusion model to the sample less dataset, the adapted model was used to generate for each category under the same textual cue conditions500image, the text prompt is associated with theDataDreamused are the same, resulting in a synthetic dataset\({\mathcal{D}}^{synth}\) . In using only synthesized images or a combination of synthesized and real less sample images\({\mathcal{D}}^{fs}\) Train the classifier on it.

For the training of the classifier, an adaptation ofCLIPmodel, similar to previous work in sample less classification. ToCLIP ViT-B/16The image encoder and text encoder of the model add theLoRAAdapter. When training with both synthetic and real images, a weighted average of the losses from real and synthetic data is used.

\[\begin{equation} \mathcal{L}_{\text{C}} = \,\, \lambda \, \mathbb{E}_{(x,y) \sim {\mathcal{D}}^{\text{fs}}} \, \text{CE}(f(x),y) + (1 \!-\! \lambda) \, \mathbb{E}_{(x,y) \sim {\mathcal{D}}^{\text{synth}}} \, \text{CE}(f(x),y) \, , \end{equation} \]

included among these\(\lambda\) is the weight assigned to the loss from the real data, and the function\(\text{CE}\) is the cross-entropy loss.

Implementation details

on the basis ofStable Diffusionreleases2.1RealizedDataDream, the computation is based on three random seeds. For each seed, a small number of images are randomly sampled from the training sample of each dataset. Training on all datasets200cycles, with a batch size of8The only exceptions are\(\text{DataDream}_{\text{dset}}\) existImageNethand-on training100cycles. As a result, the\(\text{DataDream}_{\text{dset}}\) cap (a poem)\(\text{DataDream}_{\text{cls}}\) have the same amount of training computation, i.e., every\(N\) classifier for individual things or people, general, catch-all classifier\(\text{DataDream}_{\text{cls}}\) Adapter weights (one for each class) are implemented\(S/N\) sub-updating step, wherein\(S\) is the entire dataset of\(\text{DataDream}_{\text{dset}}\) The total number of steps in the

utilizationAdamWAs an optimizer, the learning rate is\(1e-4\) , and using a cosine annealing scheduler. For theDataDreamAll adaptation weights in theLoRA(military) rank\(r=16\) . ForDataDreamThe synthesized images are generated using the50Sub-steps and guiding criteria2.0. If not mentioned, each class generates500images. For the classifier, use theCLIP ViT-B/16as the base model, and theCLIPimage encoder and text encoder applications of theLoRAPerform fine-tuning at the level of16. Set the weights assigned to the true loss term as\(\lambda=0.8\) 。

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

DataDream: tune it up for the better, LoRA fine-tuned SD-based training ensemble into a new scheme | ECCV'24

Preliminaries

Latent diffusion model

Low-rank adaptation

DataDream method

Implementation details