From DDPM to DDIM (i) Extremely large likelihood estimation and evidence lower bounds
Now there are many explanations about DDPM and DDIM on the Internet, but no matter what kind of explanation, it is not as painful as pushing it to once by yourself. The author hopes to do a complete derivation of the diffusion model from beginning to end on this article. Many parts of this article are referenced from Calvin Luo[1] and Stanley Chan[2] The classic tutorial written. We also recommend you to read and learn.
DDPM[3]is a two-way Markov model which is divided into a diffusion process and a sampling process.
The diffusion process is a continuous addition of noise to the image, adding a small amount of Gaussian noise at each step until the image becomes completely pure Gaussian noise.Why add small Gaussian noise incrementally, rather than in one step and add very strong noise straight away? We'll leave that for later.
The sampling process, on the contrary, is a process of continuous denoising of a pure Gaussian noise image, gradually restoring the original image.
The following figure illustrates the Markov model in the original DDPM.
included among these\(\mathbf{x}_T\)representing pure Gaussian noise.\(\mathbf{x}_t, 0 < t < T\) represents the intermediate hidden variable, the\(\mathbf{x}_0\) represents the generated image. From the\(\mathbf{x}_0\) step by step increase the noise level to\(\mathbf{x}_T\) The process is without neural network parameters, it is simply a linear combination of Gaussian noise and image or hidden variables, and the single-step noise addition process is performed using the\(q(\mathbf{x}_t | \mathbf{x}_{t-1})\)to represent. But the denoising process, which is unknown to us, is represented here as a single-step denoising process, which we denote by\(p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})\) to represent. The reason for adding a\(\theta\) subscript because\(p_{\theta}(\mathbf{x}_{t-1} | \mathbf{x}_{t})\) is the transfer probability approximated with a neural network.\(\theta\) represents the neural network parameters.
The diffusion model first requires a large number of images for training, and the goal of training is to estimate the probability distribution of the images. After training, the process of generating images is to sample the calculated probability distribution. Therefore, generative models generally have training algorithms and sampling algorithms, and VAE, GAN, diffusion, and nowadays the big fire of the big prediction model (LLM) are no exception. The DDPM and DDIM discussed in this paper are the same in terms of training methods, except that DDIM differs from the former in terms of sampling methods[4]。
The most classical method for estimating the probability distribution of a generated sample is great likelihood estimation, and we start with great likelihood estimation.
1. Start with great likelihood estimation
Begin with a brief review of some basic concepts in probability theory, marginal probability densities, joint probability densities, probability multiplication formulas, and Markov chains, and end with a review of a powerful mathematical tool: the Jenson inequality. Students who are familiar with these can skip Section 1.1.
1.1 Conceptual review
Edge probability density and joint probability density: You may remember the marginal probability density from probability theory, and it doesn't matter if you have forgotten it, let's briefly review it. For a two-dimensional random variable\((X, Y)\), whose joint probability density function is\(f(x, y)\)Then I don't care.\(Y\)look at\(X\)The probability density of the\(X\)of the edge probability density, which is computed as follows:
probability multiplication formula:: For joint probability\(P(A_1 A_2 ... A_{n})\)if\(P(A_1 A_2 ... A_{n-1}) 0\)Then:
The probability multiplication formula can be proved using the definition of conditional probability and mathematical induction.
Markov chain definition:: Stochastic processes\(\left\{X_n, n = 0,1,2,...\right\}\)call sth (by a name)Markov chain (mathematics), if the stochastic process at some point in time has a random variable\(X_n\) Take only finite or columnizable values (e.g., the set of non-negative integers, if not otherwise specified, in terms of the set of\(\mathcal{S}\) to represent it), and for an arbitrary\(n \geq 0\) and arbitrary states\(i, j, i_0, i_1, ..., i_{n-1} \in \mathcal{S}\)Yes
included among these\(X_n = i\) denotes the process at the moment\(n\) be in a state\(i\)The call\(\mathcal{S}\) is the state space of the process. The above equation portrays the properties of a Markov chain, called Markovianity.
Jenson's inequalityJenson's inequality has several forms, and we use its integral form here.
as\(f(x)\) is a convex function and the other function\(q(x)\) Satisfaction:
Then there is:
Further, if\(x\) is a random variable\(X\) The value of the\(q(x)\) is a random variable\(X\) probability density function, then Jenson's inequality is:
For the proof of Jenson's inequality, it is sufficient to prove it by the definition of a convex function. There are many of them on the Internet, so I won't repeat them here.
1.2. Probability distribution representation
The main goal of generating a model is to estimate the probability distribution of the data to be generated. Here it is.\(p\left(\mathbf{x}_0\right)\)How to estimate\(p\left(\mathbf{x}_0\right)\)yet. A more straightforward idea would be to put\(p\left(\mathbf{x}_0\right)\)as the edge probability of the entire Markov model:
here are\(p\left(\mathbf{x}_{0:T}\right)\)indicate\(\mathbf{x}_{0}, \mathbf{x}_{1}, ..., \mathbf{x}_{T}\) Joint probability distribution of multiple random variables.\(d \mathbf{x}_{1:T}\) express support for\(\mathbf{x}_{1}, \mathbf{x}_{2}, ..., \mathbf{x}_{T}\) these\(T\) A random variable for multiple integrals.
Obviously, this integral is poorly solved. the opening work of Sohl-Dickstein et al.'s diffusion model in 2015[5]This is the method used in.
Sohl-Dickstein et al. borrowed from statistical physics techniques: annealed importance sampling (annealed importance sampling) and Jarzynski equality. these two involves the author's knowledge of the blind spot, interested students can find their own relevant information to learn. (It is true that if you don't have a solid foundation in math and physics, you can't do scientific research~).
Here some students may wonder why using both the numerator and denominator of the\(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\) of the factor multiplied into it? Here the author tries to give another explanation, which is that when we are looking for the marginal distribution, we can try to split the joint probability distribution and figure out a way to multiply a term that is known and similar to it, and then put those terms in the numerator and denominator and have them compared separately. Because this is a form of KL scatter, and KL scatter is better calculated.\(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\) The advantage of this is that it can also be disassembled according to the Bayesian formula and the Markov property into a concatenated product of multiple conditional probabilities that are related to the\(p\left(\mathbf{x}_{0:T}\right)\) The conditional probabilities after disassembly can be almost one-to-one, and each conditional probability represents a single-step transfer probability of the diffusion process, as we all know. So why not use\(q\left(\mathbf{x}_{0:T}\right)\) What about it? Actually\(p\) respond in singing\(q\) essentially a symbol.\(q\left(\mathbf{x}_{0:T}\right)\) respond in singing\(p\left(\mathbf{x}_{0:T}\right)\) It actually indicates one thing.
The natural question that arises here is that the joint probability density of such a bunch of random variables, which we still don't know, ah\(p\left(\mathbf{x}_{0:T}\right)\) cap (a poem)\(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\) How does that indicate?
Using the probability multiplication formula, there are:
Here we are individually putting\(p\left(\mathbf{x}_{T}\right)\)This is because\(\mathbf{x}_{T}\) Obey the Gaussian distribution, which is the distribution we know; if the opposite direction is represented, so represented:
Eq. (3) is clearly inferior to Eq. (2) when expressed this way, since we are initially asking for the\(p\left(\mathbf{x}_{0}\right)\) and the calculation of equation (3) requires the knowledge that\(p\left(\mathbf{x}_{0}\right)\), which leads to a dead end. Therefore academics use equation (2) to disentangle the joint probability.
Because the diffusion model is a Markov chain, the random variable at a given moment is only related to the previous moment, so:
So there:
At the beginning of the article, it was stated that in the sampling process of the diffusion model, the single-step transfer probabilities are not known and need to be fitted with a neural network, so we add a subscript to each of the single-step transfer probabilities of the sampling process\(\theta\), which gives the final joint probability:
Similarly, let's compute\(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\) The disassembly indicated:
And so it got to be in the form of\(\mathbf{x}_0\) is the joint probability distribution of the conditional diffusion process:
One of the very important things about mathematical derivations is separating out which quantities are known and which are unknown.\(q\left(\mathbf{x}_{t} | \mathbf{x}_{t-1}\right)\) is known because, by definition of the diffusion model, the\(q\left(\mathbf{x}_{t} | \mathbf{x}_{t-1}\right)\) is obeying a Gaussian distribution; in addition\(p\left(\mathbf{x}_{T} \right)\) is also known because\(\mathbf{x}_{T}\) represents pure Gaussian noise, so\(p\left(\mathbf{x}_{T} \right)\) It also follows a Gaussian distribution.
1.3. Great Likelihood Estimation
Now that we know\(p\left(\mathbf{x}_{0:T}\right)\) cap (a poem)\(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\) expression, we can now proceed to simplify Eq. (1). First we have to scale Eq. (1), which is difficult for us to compute, we can compute the lower bound of Eq. (1), and then maximizing its lower bound is equivalent to maximizing Eq. (1). This is indeed an ingenious method, and this method has been used in the VAE derivation. It does not matter if you do not understand the VAE, we will re-deduce it here.
There are two general approaches to calculating the lower bound of Eq. (1). They areJenson's inequality methodcap (a poem)KL dispersion method. We give the derivation of each of the two methods below.
Jenson's inequality method. When performing great likelihood estimation, the probability distribution is generally taken logarithmically, so we obtain by taking the logarithm of equation (1):
\(\mathcal{L} = \mathbb{E}_{q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)} \left[ \log \frac{p\left(\mathbf{x}_{0:T}\right)}{{q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)}}\right]\) Known as a random variable\(\mathbf{x}_0\) (used form a nominal expression)lower bound of evidence (math.) (Evidence Lower BOund, ELBO)。
KL dispersion method. Of course, we can also dispense with Jenson's inequality and utilize the non-negativity of the KL scatter to similarly derive the evidence lower bound. Expanding the mathematical expectation in the lower bound of evidence is written in integral form as:
In addition, we define a KL scatter:
We will verify that below:
Specifically, there are:
Therefore, equation (6) holds. Since\(\text{KL}\left(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right) || p\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\right) \geq 0\)So:
Personally, I still prefer Jenson's inequality method, because the idea of this method is one and done; while KL scattering method is like knowing the final answer first, and then taking to verify the correctness of the answer. Moreover, the non-negativity of KL can be proved by Jenson's inequality, so both of them are essentially the same in mathematical principles.KL scattering method has an advantage that it allows us to know the\(\log p\left(\mathbf{x}_{0}\right)\) How big is the gap with the lower realm of evidence, where the two differ by just one KL dispersion:\(\text{KL}\left(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right) || p\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\right)\). Regarding the physical significance of these two probability distributions, the author believes that they can be understood in this way:\(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\) is true in the given picture\(\mathbf{x}_0\) as the conditional forward joint probability, and the\(p\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\) is the conditional forward joint probability that we estimate. The KL scatter, on the other hand, describes the distance between two probability distributions, and the distance between the two is relatively small if we estimate it accurately. This therefore corroborates our use of the evidence lower bound as an alternative to the\(\log p\left(\mathbf{x}_0\right)\) Reasonableness of doing great likelihood estimation.
Below, our direction is to progressively simplify the evidence lower bound until it is reduced to a form that we can implement programmatically.
2. Simplifying the lower boundaries of evidence
The simplification of the evidence lower bound requires the use of three expressions that we derived earlier. For ease of reading, we rewrite equation (4), equation (5), and the evidence lower bound here.
In the following, we substitute equations (4) and (5) into (7) and have:
The blue part of the operation is to keep the probability of the random variable represented by the numerator and denominator consistent, so that the description of the probability distribution of the same random variable is comparable. Moreover, we want the hyphen subscripts of the numerator and denominator to be consistent so that we can further simplify. We continue below:
The first, second, and third terms of the above equation are called, respectively, theReconstruction Term、Prior Matching Term、Consistency Term。
- reconstruction project.. As the name suggests, this is the predicted probability of the final composition. Given the predicted final hidden variable\(\mathbf{x}_1\)The image generated by the prediction\(\mathbf{x}_0\) The logarithmic probability of the
- a priori matching term. This item describes the similarity of the Gaussian noise generated in the last step of the diffusion process to the pure Gaussian noise, and since this item has no neural network parameters, it does not need to be optimized, and this item can be rounded off for subsequent network training.
- consistent term. This item describes the single-step transfer probability of the sampling process\(p_{\theta}\left(\mathbf{x}_{t}|\mathbf{x}_{t+1}\right)\) and the single-step transfer probability of the diffusion process\(q\left(\mathbf{x}_{t}|\mathbf{x}_{t-1}\right)\) of the distance. Since the\(q\left(\mathbf{x}_{t}|\mathbf{x}_{t-1}\right)\) is obeying a Gaussian distribution (defined by the noise addition process itself), so we want the single-step transfer probability of the sampling process to be\(p_{\theta}\left(\mathbf{x}_{t}|\mathbf{x}_{t+1}\right)\) also obeys a Gaussian distribution so as to make the KL scatter of the two closer. We will see later that minimizing the KL scatter of the two is equivalent to maximum likelihood estimation.
To this point we can see by observation that multiplying the\(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\), in the expectation of which there are many uncorrelated random variables, so these three terms can be further simplified. The above equationreconstruction projectAs an example, we write the reconstruction term of the above equation in integral form:
Some students may be confused as to which variable is the integral microvariable, and if they don't know they count all random variables as integral microvariable. It is possible to accumulate away by integration if any of the microelements are not needed. Notice that.\(\mathbf{x}_0\) It's a real image, not a random variable, so the random variable is at most\(\mathbf{x}_{1:T}\). Let's look at it specifically:
Similarly, we look ata priori matching term cap (a poem)consistent term. The a priori matches are all but\(\mathbf{x}_T\) respond in singing\(\mathbf{x}_{T-1}\) All random variables other than these two are integrated and accumulated, and so it is:
A similar operation is used for consistent items:
We continue the simplification below by simplifying the form of the multiplicative KL scatter. Since the KL scatter of two Gaussian distributions can be written in the form of a two-paradigm Loss, this is realizable by our programming. We start by giving the definition of the KL scatter. Let two probability distributions and\(Q\) cap (a poem)\(P\), in the case of continuous random variables, their probability density functions are respectively\(q(x)\) \(p(x)\), then the KL dispersion of the two is:
Note that there is no symmetry in the KL scatter, i.e., the\(\text{KL}\left(Q || P\right)\) respond in singing\(\text{KL}\left(P || Q\right)\) It's different.
Below, we takeconsistent term in one of them as an example to write in KL dispersion form:
The part of the above equation in red is a reference to those two tutorials mentioned in the opening post. The author pushed it himself and did not come up with the appropriate results.
If there is a good explanation, feel free to discuss it. But it doesn't matter if it's strict here, we'll explain it later, and in fact we're using a different derivation.
Similarly.a priori matching term can also be expressed in a similar way in the form of KL scatter:
The part in red here we can verify in detail:
There's nothing wrong with that.
We organize the results below. Our simplified lower bound on the evidence is:
Let's look at the third item, which isconsistent term. We find two probability distributions\(q\left(\mathbf{x}_{t} | \mathbf{x}_{t-1}\right)\) respond in singing\(p_{\theta}\left(\mathbf{x}_{t}|\mathbf{x}_{t+1}\right)\) There are timing mismatches, as demonstrated by the pink and green lines in the figure below. This is equivalent to minimizing two probability distributions with one wrong bit, which is clearly not the desired result. Although one wrong place has little effect, it is always imperfect. Moreover, the two transfer probabilities are not in the same direction. Therefore, in the following, we have to figure out how to optimize the lower bound of the evidence to see if we can derive the KL dispersion of two probability distributions that are perfectly aligned in the time series.
How to optimize the evidence lower bound. Let's put it in the next post:
From DDPM to DDIM (ii) DDPM training and reasoning.
-
Luo C. Understanding diffusion models: A unified perspective[J]. arXiv preprint arXiv:2208.11970, 2022. ↩︎
-
Chan S H. Tutorial on Diffusion Models for Imaging and Vision[J]. arXiv preprint arXiv:2403.18103, 2024. ↩︎
-
Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851. ↩︎
-
Song J, Meng C, Ermon S. Denoising diffusion implicit models[J]. arXiv preprint arXiv:2010.02502, 2020. ↩︎
-
Sohl-Dickstein J, Weiss E, Maheswaranathan N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//International conference on machine learning. PMLR, 2015: 2256-2265. ↩︎