VAE in SD, you can't miss it!

What is VAE?

VAE, or Variational Autoencoder, is a generative model that reconstructs the input data by learning its latent representation.

In a Stable Diffusion 1.4 or 1.5 model, the model is partially updated via VAE to enhance the model's ability to render eyes. With this update, the model is able to more accurately capture and reproduce the details of the eye when generating an image, which improves the realism and quality of the overall image.

VAE consists of two parts: an encoder and a decoder. The encoder maps the input data to a potential space and the decoder reconstructs the data from this potential space.

Do I need a VAE?

In fact, you don't need to install VAE files to run Stable Diffusion, and any model you use (whether v1, v2, or custom) already has a default VAE.

When people say download and use VAE, they mean use itsrevised version。

This happens when the model trainer uses other data to further fine-tune the VAE part of the model. In this case, it is not necessary to release the entire large model, only the VAE portion of it.

What are the effects of using VAE?

Improvements in VAE (Variable Auto-Encoder) generally mean that it is able to decode images more accurately from latent space, especially when dealing with fine details, such as the rendering of eyes and text.

In the context of the Stable Diffusion model, the improved VAE decoder can more efficiently capture and reproduce subtle features in an image, which is essential for generating high-quality images.

Stability AI has released two fine-tuned variants of the VAE decoder, namely:

EMA (Exponential Moving Average): This is a statistical method usually used for smoothing time series data. In machine learning, EMA is sometimes used for updating model parameters for a more stable training process.
MSE (Mean Square Error): This is a commonly used error metric to measure the difference between a model's predicted and actual values. In the context of an autoencoder, MSE can be used as an optimization objective to help the model learn to reconstruct the input data more accurately.

These two variants may correspond to different training strategies or objective functions designed to optimize the performance of the VAE decoder, especially in terms of rendering details. The choice of which variant may depend on the particular application scenario and the desired output quality.

Using these fine-tuned variants of the VAE decoder, one can expect to see the following improvements in the resulting images:

Clearer text: Text edges and letterforms can be sharper and more accurate.
More realistic eye rendering: Details of the eye, such as the iris, pupil, and reflexes, can be more fine-grained and lifelike.

Here are some examples of comparisons:

So which one should be used?

Stability AI's evaluation of 256×256 resolution images indicates that the VAE decoder using EMA (Exponential Moving Average) produces images that are sharper in detail, while the decoder using MSE (Mean Square Error) produces images that are visually smoother.

In the tests of Stable Diffusion v1.4 and v1.5 on 512×512 resolution images, it can be observed that the quality of the eye rendering improves in some cases, especially when the face is a smaller part of the image. However, there is not much improvement in text rendering.

So to summarize, the new VAE update at least doesn't degrade the performance of the model, it either improves the render quality or keeps it at the original level.

The two fine-tuned VAE decoder variants, EMA (Exponential Moving Average) and MSE (Mean Square Error), are compatible with the Stable Diffusion v2.0 model. Though the improvements they may bring in v2.0 are relatively minor, as v2.0 itself is already quite good at rendering eyes.

Should I use VAE?

The decision to use VAE (Variable Auto-Encoder) really depends on how satisfied you are with the current results and how much detail improvement you are looking for.

If you are already satisfied with the results: If you are using an application or technique such as CodeFormer Face Repair that already achieves the image quality you expect, especially in detailed areas such as the eyes, then you probably don't need to introduce an additional VAE to further enhance the effect.
Pursue all possible improvements: If you're after every possible performance improvement, even small ones, then using VAE might be an option worth considering.

How do I use VAE?

downloading

Stability has now released two improved versions of VAE. Below are the direct download links.

/stabilityai/sd-vae-ft-ema-original/resolve/main/

/stabilityai/sd-vae-ft-mse-original/resolve/main/

mounting

If you are using webUI. then just put the downloaded VAE file in the directory: 'stablediffusion-webui/models/VAE'.

Linux and Mac OS users

For convenience, run the following command in the stable-diffusion-webui directory on Linux or Mac OS, which will automatically download and install the VAE files.

wget /stabilityai/sd-vae-ft-ema-original/resolve/main/ -O models/VAE/

wget /stabilityai/sd-vae-ft-mse-original/resolve/main/ -O models/VAE/

Using VAE in the webUI

To use VAE in the AUTOMATIC1111 GUI, click the Settings tab on the left and then click the VAE section.

existSD VAE drop-down menu, select the VAE file you want to use.

If your page doesn't have this option, then just go to settings->user interface->quick settings list and add sd_vae to it:

Click on me for more highlights.