From DDPM to DDIM (III) DDPM Training and Reasoning
review of previous events
It is better to start by reviewing the results of the previous discussion.
The structure of diffusion models and the significance of individual probability models. The following figure illustrates the two-way Markov model of DDPM.
included among these\(\mathbf{x}_T\)representing pure Gaussian noise.\(\mathbf{x}_t, 0 < t < T\) represents the intermediate hidden variable, the\(\mathbf{x}_0\) represents the generated image.
- \(q\left(\mathbf{x}_{t} | \mathbf{x}_{t-1}\right)\) The single-step transfer probability of the noise addition process, obeying a Gaussian distribution, is well understood.
- \(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}\right)\) is the true single-step transfer probability of the sampling process, but solving it is more difficult.
- \(p_{\theta}\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}\right)\) represents the probability fitted by the neural network, and we expect the neural network to better fit the single-step transfer probability of the sampling process.
- \(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}\right)\)Given the final generated result\(\mathbf{x}_{0}\) The single-step transfer probability of the generative process under the conditions of the\(\mathbf{x}_{0}\) Like labeling in supervised learning, it guides the direction of generation. We adopt this probability as an alternative to\(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}\right)\) Do the fitting of the neural network. If you can't understand it, just think of it as a mathematically intermediate variable with no physical meaning.
\(p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})\) to represent. The reason for adding a\(\theta\) subscript because\(p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})\) is the transfer probability approximated with a neural network.\(\theta\) represents the neural network parameters.
joint probability representation (JPR) The joint probability of the diffusion model and the forward conditional joint probability are:
Specific expressions for probability distributions The specific expressions for the various conditional probabilities mentioned previously are:
included among these
Also.\(p\left(\mathbf{x}_{T}\right)\) obeying the standard Gaussian distribution.\(p_{\theta}\left(\mathbf{x}_{t-1}|\mathbf{x}_{t}\right)\) is the neural network we want to train.
According to the Bayesian formula, the conditional probabilities we want to transform are as follows:
lower bound of evidence (math.) We were originally going to do a great likelihood estimate of the generated image distribution, but the direct estimate was not computable. So instead, we maximized the lower bound of the evidence and then simplified the lower bound of the evidence, now using the\(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}\right)\) Re-optimizing the evidence lower bound:
3.5 Utilization\(q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})\) Reducing the Evidence Lower Bound
The book picks up where we left off. One idea we have for simplifying the lower bound of the evidence is that we would like to put the\(p_{\theta}\left(\mathbf{x}_{t-1}|\mathbf{x}_{t}\right)\) cap (a poem)\(q\left(\mathbf{x}_{t} | \mathbf{x}_{t-1}\right)\) each of the following; and align each of the items containing the\(\left(\mathbf{x}_{0}, \mathbf{x}_{1}\right)\) of the term is separated from the other terms. Because the\(\mathbf{x}_{0}\) are images, while the other random variables are hidden variables. Another interpretation is that this time we have used\(q(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0})\)and when\(t = 1\) when\(q(\mathbf{x}_{0} | \mathbf{x}_{1}, \mathbf{x}_{0})\) It seems like it's pointless. That's why we're going to set the values containing the\(\left(\mathbf{x}_{0}, \mathbf{x}_{1}\right)\) of the item is separated from the other items.
As before, the three terms of the above equation also represent each of the three parts:reconstruction project、a priori matching term cap (a poem)consistent term。
- reconstruction project.. As the name suggests, this is the predicted probability of the final composition. Given the predicted final hidden variable\(\mathbf{x}_{1}\)The image generated by the prediction\(\mathbf{x}_{0}\) The logarithmic probability of the
- a priori matching term. This term describes the similarity of the Gaussian noise generated in the last step of the diffusion process to the pure Gaussian noise, compared to the previous\(q\) The condition in the section was changed to\(\mathbf{x}_{0}\). Again, this item does not have a neural network parameter, so it does not need to be optimized, and this item can be rounded off for subsequent network training.
- consistent term.. This item differs from the previous one in two ways. One, there is no longer a mismatch comparison compared to the previous one. Two, this matching target has been changed to be a target that is determined by the\(p_{\theta}\left(\mathbf{x}_{t}|\mathbf{x}_{t+1}\right)\) toward\(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}\right)\) matches, whereas before it was with the single-step transfer probability of the diffusion process\(q\left(\mathbf{x}_{t}|\mathbf{x}_{t-1}\right)\) Match. More reasonable.
Similarly, as in the previous operation, we approximate away the irrelevant random variables in the subscripts of the mathematical expectation of the above equation (integrating to 1), and then transform it into the form of the KL scatter. Let's see.a priori matching term respond in singingconsistent term。
The reconstruction term is similar, expecting that the probability of the lower corner labeling in addition to the random variable\(\mathbf{x}_1\) Anything other than that can be approximately dropped. Finally, we arrive at the KL scattering form of the lower bound of the evidence:
Here's a chat about what the physical meaning of the subscripts of mathematical expectations is. In the case of the reconstruction term, for example, the hypotenuse is\(q\left(\mathbf{x}_{1} | \mathbf{x}_{0}\right)\)The representative used\(\mathbf{x}_{0}\) Noise generation in one step\(\mathbf{x}_{1}\)Then use the\(\mathbf{x}_{1}\) Input to the neural network to get the estimated\(\mathbf{x}_{0}\) distribution and then maximize this log-likelihood probability. The mathematical expectation represents multiple images, averaged after one epoch as the expectation. The concordance term is similar, except that it uses the\(\mathbf{x}_{0}\) generating\(\mathbf{x}_{t}\), which is then computed by a neural network with\(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}\right)\) of the KL dispersion. This is effectively Monte Carlo estimation.
So, there are two terms we need to calculate the loss, the logarithmic part of the reconstruction term and the KL scatter in the agreement term. As for the mathematical expectation and subscripts, we don't need to unfold the calculation, but instead replace them with multiple images and add different levels of noise to each during training.
4. Training process
In the following, we utilize Eq. (3) to further develop the evidence lower bound Eq. (6). From DDPM to DDIM (ii) This article talks about the fact that the\(\beta_t\) Very small premise.\(p_{\theta}\left(\mathbf{x}_{t-1}|\mathbf{x}_{t}\right)\) also obeys a Gaussian distribution. Since\(p_{\theta}\left(\mathbf{x}_{t-1}|\mathbf{x}_{t}\right)\) The training goal is to match\(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}\right)\), which we also write in the form of a Gaussian distribution and with the\(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}\right)\) of the form to make comparisons.
here are\(p_{\theta}\left(\mathbf{x}_{t-1}|\mathbf{x}_{t}\right)\) average value\(\textcolor{blue}{\tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right)}\) is the neural network output, and for variance we use and\(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}\right)\) Same variance. Neural Networks\(\textcolor{blue}{\tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right)}\) There are two inputs to the\(\mathbf{x}_{t}\)This is obvious, there is also an input time moment\(t\)Because of course, the variance can also be trained as a neural network, but experiments were done in the original DDPM article, and this is not very effective. Thus, of the two means two variances above, only the blue\(\textcolor{blue}{\tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right)}\) is unknown and the other three quantities are known quantities.
According to Eq. (6), we only need to computereconstruction project cap (a poem)consistent term, the a priori matching term has no training parameters. They are computed separately below:
included among these\(\text{const}\) represents some constant.
The consistent term, i.e., the KL scatter, is computed below. There is a formula for the KL scatter of the Gaussian distribution, which we give without proof; if you need a proof, consult Wikipedia. Two\(d\) A dimensional random variable obeys a Gaussian distribution\(Q = \mathcal{N}(\bm{\mu}_1, \bm{\Sigma}_1)\) , $P = \mathcal{N}(\bm{\mu}_2, \bm{\Sigma}_2) $ where\(\bm{\mu}_1, \bm{\mu}_2 \in \mathbb{R}^{d}, \bm{\Sigma}_1, \bm{\Sigma}_2 \in \mathbb{R}^{d \times d}\) The Kullback-Leibler scatter (KL scatter) for both can be calculated using the following equation:
Below we substitute the consistent term into the above equation:
From the last two equations, we can see that\(\tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right)\) exist\(t > 0\) when the goal is to match the\(\tilde{\bm{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)\). Our research philosophy is that whenever we have an analytic form, we expand the analytic form until there is no analytic solution for a particular variable, at which point we fit it with a neural network, which maximizes the fit. For example, in order to fit a quadratic function\(f(x) = a x^2 + 3 x + 2\)which\(a\) is the unknown quantity, we should design a neural network to estimate the\(a\)and should not be estimated with neural networks\(f(x)\), because the former ensures that the function estimated by the neural network is quadratic, while the latter has more uncertainty.
For a better match, we expand\(\tilde{\bm{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)\) The parsed form in the
\(\tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right)\) The unfolding format is similar to that of\(\tilde{\bm{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)\) Same. The first is the same as\(\mathbf{x}_{t}\) Relevant because\(\mathbf{x}_{t}\) is the input, so it stays the same, but\(\mathbf{x}_{0}\) are unknown quantities, so we still use a neural network instead, and the inputs to the neural network are likewise\(\mathbf{x}_{t}\) cap (a poem)\(t\)The following is an example of a formula that can be used in the following way. Substituting equation (8) into equation (7), we have:
The reconstruction term can also continue to be simplified, noting that\(\beta_0 = 0, \alpha_0 = 1, \overline{\alpha}_{0} = 1, \overline{\alpha}_{1} = \alpha_1\):
The last line of the above equation is for consistency with the form of the KL scatter. After all this time, we have finally reduced the evidence lower bound to its simplest form. We substitute our computed reconstruction and agreement terms into Eq. (6) and discard the a priori matching terms that are not related to the parameters of the neural network, there:
Since there is a negative sign in front of it, maximizing the evidence lower bound is equivalent to minimizing the following loss function:
Understanding the above equation is also simple. First we look at the weights of each item\(\frac{1}{2 \sigma^2(t)} \frac{\left(1-\alpha_t\right)^2 \overline{\alpha}_{t-1}}{\left(1-\overline{\alpha}_t\right)^2}\), which represents the weight of the predicted loss at each stage of the Markov chain, the experiments in the DDPM paper demonstrate that ignoring this weight has little effect, so we continue to simplify to:
The way this is accomplished is by giving you an image\(\mathbf{x}_0\)Then follow the different steps to add noise up to\(T\) step noise to obtain\(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T\) a hidden variable. As shown in the figure below, due to the nature of multi-step transfer probabilities, we can start with the\(\mathbf{x}_0\) One step adds noise to any noise stage.
These hidden variables are then fed into the neural network individually and the output is compared to the\(\mathbf{x}_0\) Calculate the two-paradigm Loss, and then all the Losses are averaged. However, in practice, we have not just one image, but many. When fed into the neural network, it is also processed in the form of a batch, so if we add noise so many times to each image, the training workload will be very huge. So in fact, we use this approach: assuming that a batch of\(N\) A picture for this\(N\) The images are added with different stages of Gaussian noise, and the degree of noise added to the images is also randomized, for example, the first image is added with noise\(10\) Step, second image with noise\(910\) step, and so on. Then the noisy hidden variables and moment information are input separately, and the output of the neural network is made bi-paradigm loss with each original image separately, and finally averaged. In this way compared to adding only one image to\(1000\) The advantage of kind of different noise is to prevent overfitting on one image and then falling into local minima. Below we give the specific flow of the training algorithm:
Algorithm 1 . Training a Deniosing Diffusion Probabilistic Model. (Version: Predict image)
For every image \(\mathbf{x}_0\) in your training dataset:
- Repeat the following steps until convergence.
- Pick a random time stamp \(t \sim \text{Uniform}[1, T]\).
- Draw a sample \(\mathbf{x}_{t} \sim q\left(\mathbf{x}_{t} | \mathbf{x}_{t}\right)\), .
- Take gradient descent step on
You can do this in batches, just like how you train any other neural networks. Note that, here, you are training one denoising network \(\tilde{\mathbf{x}}_{\boldsymbol{\theta}}\) for all noisy conditions.
If batch is used for training, the above operations are performed separately and simultaneously for each image, and it is worth noting that there is only one neural network parameter for either\(t\) step denoising, which differs only in the inputs, while the neural network has only\(\tilde{\mathbf{x}}_{\boldsymbol{\theta}}\) One. The training schematic is shown below:
On a side note, the original DDPM is actually quite misleading, as shown in the original DDPM diagram below. From this diagram, perhaps some students thought that neural networks were inputs\(\mathbf{x}_{t}\) to predict\(\mathbf{x}_{t-1}\), which is not actually the case. It's the input\(\mathbf{x}_{t}\) to predict\(\mathbf{x}_{0}\)The reason for this is that we use\(q\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}, \mathbf{x}_{0}\right)\) to serve as the fitting objective, with the goal of matching its mean value\(\tilde{\bm{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)\)Instead of matching\(\mathbf{x}_{t-1}\)The And\(\tilde{\bm{\mu}}_{t}\left(\mathbf{x}_{t}, \mathbf{x}_{0}\right)\) It just so happens that\(\mathbf{x}_{0}\) function, so we are actually inputting when we train on the\(\mathbf{x}_{t}\) Predicting with Neural Networks\(\mathbf{x}_{0}\). And it is the sampling process that samples step by step. It is precisely because the object that the neural network fits to during training is not\(\mathbf{x}_{t-1}\), so that gives us room for acceleration in the sampling process, which is an afterthought.
5. Reasoning process
People, don't go through the papers yet, what do you think is one of the simplest ideas for generating images. I was thinking that since neural networks\(\tilde{\mathbf{x}}_{\boldsymbol{\theta}}\) is an input\(\mathbf{x}_{t}\) to predict\(\mathbf{x}_{0}\)So can't we just give him a random noise and generate the image in one step? This question is doubtful, because the latest research does have a single-step image generation, but the author has not yet read it carefully, so I will not comment on it for the time being.
In accordance with the Markov property, it is better to use\(p_{\theta}\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}\right)\) Do Monte Carlo generation step by step:
included among these\(\sigma^2 \left(t\right) = \frac{\left(1 - \alpha_t\right) \left( 1 - \overline{\alpha}_{t-1} \right)}{ 1 - \overline{\alpha}_{t} }\)
What strikes me about diffusion models is that the training process is very different from the inference process. Perhaps this is a generative model, training algorithms and inference algorithms in the form of a big difference, including the autoregressive generation of text is also the same. He is not like image classification, inference and training is the same way to calculate, just come to a final take the category with the highest probability on the line. The great difference between the training process and the inference process determines that this form of inference is not the only form, there must be a better inference algorithm.
This reasoning process is described by the following algorithm.
Algorithm 2. Inference on a Deniosing Diffusion Probabilistic Model. (Version: Predict image)
Input: the trained model \(\tilde{\mathbf{x}}_{\boldsymbol{\theta}}\).
- You give us a white noise vector \(\mathbf{x}_{T} \sim \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)\)
- Repeat the following for \(t = T, T − 1, ... , 1\).
- Update according to
Output: \(\mathbf{x}_{0}\).
- derived\(\mathbf{x}_{0}\) It also needs to be denormalized and discretized to between 0 and 255, which we'll save for another article.
- Also, in the original DDPM article, there is no direct prediction that the\(\mathbf{x}_{0}\)Instead, it's about the\(\mathbf{x}_{0}\) A reparameterization was performed to allow the neural network to predict the noise\(\bm{\epsilon}\), how this is done, we'll save that for the next post as well.
Next Post From DDPM to DDIM (IV) Predictive Noise and Raw Map Post-Processing