From DDPM to DDIM (IV) Prediction Noise and Post-Processing

review of previous events

The following figure illustrates the two-way Markov model of DDPM.

training goal. Maximizing the evidence lower bound is equivalent to minimizing the following loss function:

\[\boldsymbol{\theta}^*=\underset{\boldsymbol{\theta}}{\operatorname{argmin}} \sum_{t=1}^T \frac{1}{2 \sigma^2(t)} \frac{\left(1-\alpha_t\right)^2 \overline{\alpha}_{t-1}}{\left(1-\overline{\alpha}_t\right)^2} \mathbb{E}_{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}\left[\Vert\tilde{\mathbf{x}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)-\mathbf{x}_0\Vert_2^2\right] \tag{1} \]

process of reasoning. The inference process utilizes Markov chain Monte Carlo methods.

\[\begin{aligned} \mathbf{x}_{t-1} &\sim p_{\theta}\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}\right) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right) , \sigma^2 \left(t\right) \mathbf{I}) \\ \mathbf{x}_{t-1} &= \tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right) + \sigma \left(t\right) \bm{\epsilon} \\ &= \frac{\left( 1 - \overline{\alpha}_{t-1} \right) \sqrt{\alpha_t}}{\left( 1 - \overline{\alpha}_{t} \right)} \mathbf{x}_{t} + \frac{\left(1 - \alpha_t\right) \sqrt{\overline{\alpha}_{t-1}}}{\left( 1 - \overline{\alpha}_{t} \right)} \tilde{\mathbf{x}}_{\theta} \left(\mathbf{x}_{t}, t\right) + \sigma \left(t\right) \bm{\epsilon} \end{aligned} \tag{2} \]

1. Predicted noise

In the previous post we mentioned that diffusion modeling neural networks are used for predicting\(\mathbf{x}_{0}\), however DDPM does not do this, but rather uses neural networks to predict noise. This is the meaning of the first letter D (Denoising) in DDPM. The author of DDPM mentions denoising score matching (DSM) in the original article, and says that such training is equivalent to DSM. It can be seen that it should be inspired by DSM. We will talk about another explanation in a moment.

Following the simplification technique of the previous post, for the predicted output of the neural network\(\tilde{\mathbf{x}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)\)The parameterization can also be further parameterized:
Known:

\[\begin{aligned} \mathbf{x}_{t} = \sqrt{\overline{\alpha}_t} \mathbf{x}_{0} + \sqrt{1 - \overline{\alpha}_t} \bm{\epsilon} \end{aligned} \tag{3} \]

And so:

\[\begin{aligned} \mathbf{x}_{0} = \frac{1}{\sqrt{\overline{\alpha}_t}} \mathbf{x}_{t} + \frac{\sqrt{1 - \overline{\alpha}_t}}{\sqrt{\overline{\alpha}_t}} \bm{\epsilon} \end{aligned} \tag{4} \]

\[\begin{aligned} \tilde{\mathbf{x}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right) = \frac{1}{\sqrt{\overline{\alpha}_t}} \mathbf{x}_{t} + \frac{\sqrt{1 - \overline{\alpha}_t}}{\sqrt{\overline{\alpha}_t}} \tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right) \end{aligned} \tag{5} \]

Here we explain the second reason why the prediction noise approach is used below. As can be seen from equations (4)(5), the noise term can be viewed as the\(\mathbf{x}_{0}\) together with\(\mathbf{x}_{t}\) of the residual term. Recall the classical Resnet structure:

\[\left[\mathbf{y}=\mathbf{x}+\mathcal{F}\left(\mathbf{x}, W_i\right)\right] \]

Resnet also uses residual terms learned by neural networks.DDPM uses prediction noise in a way that is similar to Resnet residual learning.

Below we substitute the two equations (3)(4) into equation (1) and continue the simplification with:

\[\begin{aligned} \Vert\tilde{\mathbf{x}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)-\mathbf{x}_0\Vert_2^2 &= \frac{1 - \overline{\alpha}_t}{\overline{\alpha}_t} \Vert\tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right)-\bm{\epsilon}\Vert_2^2 \end{aligned} \]

take note of\(\overline{\alpha}_t\) = \(\overline{\alpha}_{t-1} \alpha_t\)Thus a new optimization equation can be derived:

\[\boldsymbol{\theta}^*=\underset{\boldsymbol{\theta}}{\operatorname{argmin}} \sum_{t=1}^T \frac{1}{2 \sigma^2(t)} \frac{\left(1-\alpha_t\right)^2}{\left(1-\overline{\alpha}_t\right) \alpha}_t \mathbb{E}_{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}\left[\Vert\tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\sqrt{\overline{\alpha}_t} \mathbf{x}_{0} + \sqrt{1 - \overline{\alpha}_t} \bm{\epsilon}, t\right)-\bm{\epsilon}\Vert_2^2\right] \tag{6} \]

(6) denotes that our neural network\(\tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\sqrt{\overline{\alpha}_t} \mathbf{x}_{0} + \sqrt{1 - \overline{\alpha}_t} \bm{\epsilon}, t\right)\) is used to predict the most initial noise\(\bm{\epsilon}\). Ignoring the previous coefficients, the corresponding training algorithm is as follows:

Algorithm 3 . Training a Deniosing Diffusion Probabilistic Model. (Version: Predict noise)

Repeat the following steps until convergence.

For every image \(\mathbf{x}_0\) in your training dataset \(\mathbf{x}_0 \sim q\left(\mathbf{x}_0\right)\)
Pick a random time step \(t \sim \text{Uniform}[1, T]\).
Generate normalized Gaussian random noise \(\bm{\epsilon} \sim \mathcal{N} \left(\mathbf{0}, \mathbf{I}\right)\)
Take gradient descent step on

\[\nabla_{\boldsymbol{\theta}} \Vert\tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\sqrt{\overline{\alpha}_t} \mathbf{x}_{0} + \sqrt{1 - \overline{\alpha}_t} \bm{\epsilon}, t\right)-\bm{\epsilon}\Vert_2^2 \]

You can do this in batches, just like how you train any other neural networks. Note that, here, you are training one denoising network \(\tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\) for all noisy conditions.

The inference process still starts with Markov Chain Monte Carlo (MCMC), because here is the prediction noise, and the inference process needs to add noise as well, and in order to differentiate it, we use the noise added during the inference process with the\(\mathbf{z} \sim \mathcal{N} \left(\mathbf{0}, \mathbf{I}\right)\) to indicate.Noise per inference during inference\(\mathbf{z}\) are different, but the initial target noise to be fitted during training\(\bm{\epsilon}\) identical。

\[\begin{aligned} \mathbf{x}_{t-1} &\sim p_{\theta}\left(\mathbf{x}_{t-1} | \mathbf{x}_{t}\right) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right) , \sigma^2 \left(t\right) \mathbf{I}) \\ \mathbf{x}_{t-1} &= \tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right) + \sigma \left(t\right) \mathbf{z} \\ &= \frac{\left( 1 - \overline{\alpha}_{t-1} \right) \sqrt{\alpha_t}}{\left( 1 - \overline{\alpha}_{t} \right)} \mathbf{x}_{t} + \frac{\left(1 - \alpha_t\right) \sqrt{\overline{\alpha}_{t-1}}}{\left( 1 - \overline{\alpha}_{t} \right)} \tilde{\mathbf{x}}_{\theta} \left(\mathbf{x}_{t}, t\right) + \sigma \left(t\right) \mathbf{z} \end{aligned} \tag{7} \]

Substitute equation (5):

\[\begin{aligned} \tilde{\bm{\mu}}_{\theta}\left(\mathbf{x}_{t}, t\right) &= \frac{\left( 1 - \overline{\alpha}_{t-1} \right) \sqrt{\alpha_t}}{\left( 1 - \overline{\alpha}_{t} \right)} \mathbf{x}_{t} + \frac{\left(1 - \alpha_t\right) \sqrt{\overline{\alpha}_{t-1}}}{\left( 1 - \overline{\alpha}_{t} \right)} \tilde{\mathbf{x}}_{\theta} \left(\mathbf{x}_{t}, t\right) \\ &= \frac{\left( 1 - \overline{\alpha}_{t-1} \right) \sqrt{\alpha_t}}{\left( 1 - \overline{\alpha}_{t} \right)} \mathbf{x}_{t} + \frac{\left(1 - \alpha_t\right) \sqrt{\overline{\alpha}_{t-1}}}{\left( 1 - \overline{\alpha}_{t} \right)} \left( \frac{1}{\sqrt{\overline{\alpha}_t}} \mathbf{x}_{t} + \frac{\sqrt{1 - \overline{\alpha}_t}}{\sqrt{\overline{\alpha}_t}} \tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right) \right) \\ &= \text{some algebra calculation} \\ &= \frac{1}{\sqrt{\overline{\alpha}_t}} \mathbf{x}_{t} + \frac{1 - \alpha_t}{ \sqrt{ \left( 1 - \overline{\alpha}_{t} \right)\alpha}_t} \tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right) \end{aligned} \]

So the expression for the inference is:

\[\begin{aligned} \mathbf{x}_{t-1} &= \frac{1}{\sqrt{\overline{\alpha}_t}} \mathbf{x}_{t} + \frac{1 - \alpha_t}{ \sqrt{ \left( 1 - \overline{\alpha}_{t} \right)\alpha}_t} \tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right) + \sigma \left(t\right) \mathbf{z} \end{aligned} \tag{7} \]

The inference algorithm using the fitting noise strategy can be written below:

Algorithm 4 . Inference on a Deniosing Diffusion Probabilistic Model. (Version: Predict noise)

You give us a white noise vector \(\mathbf{x}_T \sim \mathcal{N} \left(\mathbf{0}, \mathbf{I}\right)\)

Repeat the following for \(t = T, T − 1, ... , 1\).

Generate \(\mathbf{z} \sim \mathcal{N} \left(\mathbf{0}, \mathbf{I}\right)\) if \(t > 1\) else \(\mathbf{z} = \mathbf{0}\)

\[\mathbf{x}_{t-1} = \frac{1}{\sqrt{\overline{\alpha}_t}} \mathbf{x}_{t} + \frac{1 - \alpha_t}{ \sqrt{ \left( 1 - \overline{\alpha}_{t} \right)\alpha}_t} \tilde{\bm{\epsilon}}_{\boldsymbol{\theta}}\left(\mathbf{x}_t, t\right) + \sigma \left(t\right) \mathbf{z} \]

Return \(\mathbf{x}_{0}\)

2. Post-processing

The first thing to notice is that in the last step of the inference algorithm, when generating the image, no noise is added, but the predicted mean is directly used as the\(\mathcal{x}_0\) The estimated value of the

In addition, the generated image was originally normalized to the\([-1, 1]\) between them, so to back-normalize to the\([0, 255]\)The first thing you need to do is to use the diffusers library. It's easier here, looking directly at the code in the diffusers library:


image = (image / 2 + 0.5).clamp(0, 1)
image = ().permute(0, 2, 3, 1).numpy()
if output_type == "pil":
    image = self.numpy_to_pil(image)

if not return_dict:
    return (image,)


def numpy_to_pil(images):
    """
    Convert a numpy image or a batch of images to a PIL image.
    """
    if  == 3:
        images = images[None, ...]
    images = (images * 255).round().astype("uint8")
    if [-1] == 1:
        # special case for grayscale (single channel) images
        pil_images = [((), mode="L") for image in images]
    else:
        pil_images = [(image) for image in images]

    return pil_images

3. Summary

Our initial goal is to estimate the probability distribution of the image, using great likelihood estimation for the\(\log p\left(\mathbf{x}_0\right)\). But it's hard to ask for a direct solution:

\[\begin{aligned} p\left(\mathbf{x}_0\right) = \int p\left(\mathbf{x}_{0:T}\right) d \mathbf{x}_{1:T} \\ \end{aligned} \\ \]

(not only ...) but also\(p\left(\mathbf{x}_{0:T}\right)\) Nor do we know. So we chose to estimate its lower bound of evidence. In calculating the lower bound on the evidence, we parsed many of the distributions and variables in the bidirectional Markov chain, and eventually derived an expression for the lower bound on the evidence in terms of KL scatter. Doing so essentially uses the known distribution\(q\left(\mathbf{x}_{1:T} | \mathbf{x}_{0}\right)\) to make approximations to unknown distributions. This is actuallyvariational inference The idea. Variational methods are about finding a function such that this function best satisfies the conditions, while variational inference is about finding a distribution such that it more closely approximates a known distribution.

Thus we while under the assumption of Gaussian distribution, the KL scatter is exactly equivalent to the square of the two-paradigm number. Maximum likelihood estimation is equivalent to minimizing the second-paradigm loss, after which it is logical to derive the training method and the inference algorithm based on Markov chain Monte Carlo. The reader is encouraged to look up knowledge related to variational inference and Markov chain Monte Carlo, and I will write an article about it sometime.

Above is all about DDPM, I used four articles to make a detailed derivation of DDPM, and in the process of writing the articles, I also figured out some details that I didn't understand before. My biggest feeling is that beginners should never believe in articles such as "read DDPM in one article", if you want to really understand DDPM, the only way is to push all the formulas aside by hand yourself.

In the next article we start to introduce a classical inference acceleration method for DDPM: DDIM