While text-to-image diffusion models have been shown to achieve state-of-the-art results in image synthesis, they have yet to demonstrate effectiveness in downstream applications. Previous studies have proposed methods for training generated data for image classifiers with limited access to real data. However, these methods have difficulties in generating internally distributed images or depicting fine-grained features, which hinders the generalization ability of classification models trained on synthetic datasets. The paper proposes
DataDream
, a framework for synthetic categorical datasets, guided by a small number of examples of target categories to more realistically represent real data distributions.
DataDream
Before generating the training data, the image generation model for theLoRA
The weights are fine-tuned, using a small number of real images. The synthesized data is then used to fine-tune theCLIP
(used form a nominal expression)LoRA
weights to improve downstream image classification performance over previous methods on a variety of datasets.Demonstrated through numerous experiments
DataDream
The validity of the10
in each of the datasets7
state-of-the-art classification accuracies were exceeded on a small number of datasets with a small amount of example data, while at the same time, the state-of-the-art classification accuracies were exceeded on other3
competitive performance on the individual datasets as well. In addition, the paper provides insights on the impact of multiple factors, such as the number of real and generated images and the effect of fine-tuning computations on model performance.
Thesis: DataDream: Few-shot Guided Dataset Generation
- Paper Address:/abs/2407.10910
- Thesis Code:/ExplainableML/DataDream
Introduction
The emergence of text-to-image generation models such as stable diffusion (Stable Diffusion
), which not only enables the creation of photo-realistic synthesized images, but also provides opportunities to augment downstream tasks. One potential application is to train or fine-tune task-specific models on synthetic data. This is particularly useful in domains where access to real data is limited, as generative models provide a cost-effective way to generate large amounts of training data. The paper investigates the impact of synthetic training data on image classification tasks in low sample settings, i.e., when there are only a small number of images available for each category, but the cost of collecting the entire dataset would be prohibitive.
Previous research has focused on using class names of a given dataset to guide the data generation process. Specifically, they used a text-to-image diffusion model to generate images, using class names as conditional inputs. To better guide the model in generating accurate depictions of the target objects, they incorporated textual descriptions of each class into the cues, which were derived from language models or manually labeled class descriptions. Although intuitive, these approaches resulted in some generated images lacking the object of interest. For example, images fromImageNet
Class name of the dataset"clothes iron
"The real image shows an appliance used for ironing clothes, and theFakeIt
Most of the images generated depict metal irons or arbitrary objects made from them (see Figure1
, left). This occurs when the generative model misinterprets the ambiguity of class names or rare categories. This inconsistency between real and synthesized images limits the informative value of the generated images in image classification and hinders performance.
In order to bridge the gap between real and synthesized images, real images can better provide the generative model with information about the characteristics of the real data distribution. For example, the concurrently being developedDISEF
The method starts with a partially noisy real image when generating a synthetic dataset and inputs a small number of samples as conditions into a pre-trained diffusion model. It also uses a pre-trained image description model to diversify text-to-image cues. While this approach improves the alignment of real and synthetic data distributions, it sometimes fails to capture fine-grained features. For example, although the aerial dataset "DHC-3-800
"The real image of the class name contains a propeller in front of the wing, but theDISEF
The resulting composite image lacks this detail (see Figure1
, right). Accurate representation of class-distinguishing features may be crucial for classification tasks, especially in fine-grained datasets.
To this end, the paper proposes a new approachDataDream
that aim to adapt generative models using a small amount of real data. Inspired by personalized generative modeling approaches that fine-tune generative models with a small number of real images depicting the same objects, the approach focuses on aligning the generative model to a target dataset with multiple classes and diverse objects of each class. This is in contrast to previous approaches to generating datasets with a small number of samples, which did not explore the possibility of fine-tuning the generative model.
Specifically, through two ways based onLoRA
to adjustStable Diffusion
: \(\text{DataDream}_{\text{cls}}\) For each class, the trainingLoRA
as well as\(\text{DataDream}_{\text{dset}}\) that trains aLoRA
. The paper is the first to propose a method for adapting a generative model using a small amount of sample data to generate synthetic training data, rather than utilizing a frozen pre-trained generative model. After training, images are generated using the same cue that was used to fine-tune theDataDream
, the resulting image depicts the object of interest (e.g., a clothing iron) or fine-grained features (e.g., aDHC-3-800
(the propeller of an airplane), as shown1
as shown in the last line of the
It was verified through extensive experiments that theDataDream
effectiveness, reaching the state-of-the-art in all datasets when using only synthetic data, and in training with both real small numbers of samples and synthetic data in the10
One dataset has7
individual obtained the best performance. To understand the effectiveness of the method, the paper analyzes the alignment between real and synthetic data, revealing that the method outperforms the baseline method in terms of alignment to the real data distribution. Finally, the scalability of the method is explored by increasing the number of synthetic data points and real samples, showing the potential benefits of larger data sets.
In summary, the contributions of the paper are as follows:
-
Having introduced
DataDream
, a novel method for small number of samples that improves theStable Diffusion
, to generate better images of like distributions that can be used for downstream training. In the10
data sets.DataDream
exist7
exceeds the state-of-the-art classification performance on a small number of samples, and the remaining3
The performance of the individual datasets is comparable. -
Emphasizes the importance of reporting results using only synthetic data. Demonstrating that when training a classifier using only synthetic data, the paper's method is able to achieve superior performance, in some cases even outperforming classifiers trained using only a small number of real sample images, suggesting that the paper's method generates images that are capable of extracting more insightful information from a small amount of real data.
-
The effectiveness of the thesis method is investigated by analyzing the distributional alignment between the synthetic data and the real data. With a small number of samples, the method generates the best alignment of synthetic data with real data.
Methodology
Preliminaries
-
Latent diffusion model
The methodology of the paper is based onStable Diffusion
implementation, which is a probabilistic generative model that learns to generate realistic images from textual cues. Given the data\((x,c) \in {\mathcal{D}}\) which\(x\) is an image that\(c\) is a description\(x\) of the title, the model learns the conditional distribution by gradually denoising Gaussian noise in the latent space\(p(x|c)\) . Given a pre-trained encoder\(E\) It takes the image\(x\) Coded as latent variable\(z\) namely\(z=E(x)\) , the objective function is defined as:
included among these\(t\) It's time to step.\(z_t\) is the distance potential variable\(z\) \(t\) step of the potential band noise data.\(\tau\) is a text encoder.\(\epsilon_{\theta}\) is the potential diffusion model. Intuitively, the parameters\(\theta\) Trained to denoise a given text cue\(c\) Potential as conditional information\(z_t\) . In the inference phase, a random noise vector\(z_T\) The potential diffusion modeling was carried out through the\(T\) sub-passage, and with the title\(c\) Together, the denoised latent variables are obtained\(z_0\) . Subsequently, the\(z_0\) Input to a pre-trained decoder\(D\) in order to generate the image\(x'=D(z_0)\) , used for text-to-image generation.
-
Low-rank adaptation
Low-rank adaptation methods (LoRA
) is a fine-tuning method for tuning large pre-trained models to downstream tasks in a parameter-efficient manner. Given pre-trained model weights\(\theta \in \mathbb{R}^{d \times k}\) ,LoRA
Introducing a new parameter\(\delta \in \mathbb{R}^{d \times k}\) that the parameter is decomposed into two matrices.\(\delta=BA\) which\(B \in \mathbb{R}^{d \times r}\) , \(A \in \mathbb{R}^{r \times k}\) and has a smallLoRA
orderliness\(r\) namely\(r \ll \min (d, k)\) 。LoRA
weights are added to the model weights to obtain the fine-tuned weights, the\(\theta^{\text{(ft)}} = \theta \!+ \delta\) to adapt to downstream tasks. During training.\(\theta\) Keep it fixed and only update\(\delta\) 。
DataDream method
The goal of the paper is to improve classification performance by utilizing synthetic images generated by the diffusion model, and it is crucial to align the distribution of the synthetic images with that of the real images. This alignment is achieved by adapting the diffusion model to a dataset of a small number of real images.
Assuming access to a dataset with a small number of samples\({\mathcal{D}}^{\text{fs}}=\{(x_i, y_i)\}_{i=1}^{KN}\) which\(x_i\) is an image that\(y_i \in \{1,2,\cdots\!, N\}\) is it labeled.\(K\) is the number of samples in each category.\(N\) is the number of categories. To match the distribution of the real data, a dataset with a small number of samples is used\({\mathcal{D}}^{\text{fs}}\) Perform fine-tuning. Specifically, LoRA weights are introduced in the text encoder and U-net of the diffusion model, where they are chosen to effectively tune the parameters of the attention layers. For each attention layer, consider the query, key, value and output projection matrices\(W_q\), \(W_k\), \(W_v\), \(W_o\), in each matrix, the linear projection is replaced by the
included among these\(h\) denotes the input/output activation of the projection, which ultimately yields each attention layer\(l\) trainableLoRA
weights\(\delta^{(l)} = \{A_{\star}, B_{\star} | \forall \star \in \{q, k, v, o\}\}\) . Bias weights are omitted to simplify notation. All other model parameters (including\(W_{\star}\) ) remains unchanged, while\(\delta\) The weights are then optimized by gradient descent.
In order to model diffusion from a pre-trainedcheckpoint
Start training with the weight matrix\(B_{\star}\) is initialized to zero, and the\(A_{\star}\) are then randomly initialized. Thus, the fine-tuned weights of the combination\(B_{\star} A_{\star}\) initially zero and gradually learn modifications to the original pre-training weights. At the time of testing, theLoRA
Weights can be updated by updating the weights\(W^{\text{(ft)}}_{\star} =W_{\star} + B_{\star} A_{\star}\) integrated into the model, making the inference time the same as the pre-trained model. Compare this with theDreamBooth
In contrast, not fine-tuning all network weights and not adding retention loss, whose regularization would prevent strong alignment with the real image.
Consider two settings further:1
) \(\text{DataDream}_{\text{dset}}\) In this setup, the entire dataset in the\({\mathcal{D}}^{\text{fs}}\) for training diffusion models onLoRA
Weights.2
) \(\text{DataDream}_{\text{cls}}\) In this setup, for each category in the dataset, initialize the\(N\) organizeLoRA
weights\(\{\delta_n|n=1,\cdots\!,N\}\) , each set of weights is specific to the subset\({\mathcal{D}}^{\text{fs}}_{n} = \{(x,y)| (x,y) \!\in {\mathcal{D}}^{\text{fs}}, y\!=\!n\}\) Conduct training.
exist\(\text{DataDream}_{\text{dset}}\) In the setup, the original model parameters\(\theta\) remains unchanged, with only theLoRA
The weights are trained and the objective function is
exist\(\text{DataDream}_{\text{cls}}\) In the setup, the\({\mathcal{D}}^{\text{fs}}_{n}\) cap (a poem)\(\delta_n\) Replacement of\({\mathcal{D}}^{\text{fs}}\) cap (a poem)\(\delta\) . Since a text-to-image diffusion model is used, by means of the function\(C\) Defines the text condition that will label the\(y\) (i.e., class names) maps to the use of standard templates "a photo of a
[CLS
]" prompt. This prompt is passed through the text encoder and used in the decoding step of the diffusion model.
Both setups have different advantages. In the\(\text{DataDream}_{\text{dset}}\) In this case, theLoRA
Weight sharing allows knowledge transfer about common features across the entire dataset. This is beneficial for fine-grained datasets that share coarse-grained features across categories. On the other hand.\(\text{DataDream}_{\text{cls}}\) More weights were assigned to learning the details of each category, which allowed the generative model to better align with the data distribution of each category.
After adapting the diffusion model to the sample less dataset, the adapted model was used to generate for each category under the same textual cue conditions500
image, the text prompt is associated with theDataDream
used are the same, resulting in a synthetic dataset\({\mathcal{D}}^{synth}\) . In using only synthesized images or a combination of synthesized and real less sample images\({\mathcal{D}}^{fs}\) Train the classifier on it.
For the training of the classifier, an adaptation ofCLIP
model, similar to previous work in sample less classification. ToCLIP ViT-B
/16
The image encoder and text encoder of the model add theLoRA
Adapter. When training with both synthetic and real images, a weighted average of the losses from real and synthetic data is used.
included among these\(\lambda\) is the weight assigned to the loss from the real data, and the function\(\text{CE}\) is the cross-entropy loss.
-
Implementation details
on the basis ofStable Diffusion
releases2.1
RealizedDataDream
, the computation is based on three random seeds. For each seed, a small number of images are randomly sampled from the training sample of each dataset. Training on all datasets200
cycles, with a batch size of8
The only exceptions are\(\text{DataDream}_{\text{dset}}\) existImageNet
hand-on training100
cycles. As a result, the\(\text{DataDream}_{\text{dset}}\) cap (a poem)\(\text{DataDream}_{\text{cls}}\) have the same amount of training computation, i.e., every\(N\) classifier for individual things or people, general, catch-all classifier\(\text{DataDream}_{\text{cls}}\) Adapter weights (one for each class) are implemented\(S/N\) sub-updating step, wherein\(S\) is the entire dataset of\(\text{DataDream}_{\text{dset}}\) The total number of steps in the
utilizationAdamW
As an optimizer, the learning rate is\(1e-4\) , and using a cosine annealing scheduler. For theDataDream
All adaptation weights in theLoRA
(military) rank\(r=16\) . ForDataDream
The synthesized images are generated using the50
Sub-steps and guiding criteria2.0
. If not mentioned, each class generates500
images. For the classifier, use theCLIP ViT-B
/16
as the base model, and theCLIP
image encoder and text encoder applications of theLoRA
Perform fine-tuning at the level of16
. Set the weights assigned to the true loss term as\(\lambda=0.8\) 。
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].