mutual information
Mutual information is used to represent two random variablesinterdependentThe extent of the Random variables\(X\) cap (a poem)\(Y\) The mutual information is defined as
included among these\(p(\boldsymbol{x}, \boldsymbol{y})\) indicate\(X\) cap (a poem)\(Y\) The joint probability density of the\(p(\boldsymbol{x})\) cap (a poem)\(p(\boldsymbol{y})\) represent respectively\(X\) cap (a poem)\(Y\) The edge probability density of the
Mutual information is a non-negative quantity if and only if\(X\) cap (a poem)\(Y\) mutually independentTime (at this time)\(p(\boldsymbol{x}, \boldsymbol{y}) = p(\boldsymbol{x})p(\boldsymbol{y})\) (Constant) is taken to the minimum value\(0\)。
In machine learning, the joint distribution\(p(\boldsymbol{x}, \boldsymbol{y})\) is often difficult to obtain, so it is often converted with a Bayesian formulation that uses the following two forms of mutual information:
information bottleneck
Let the random variable\(X\) denotes the input data.\(Z\) denotes the encoded features.\(Y\) Representation Labeling. Information Bottleneck (IB) theory suggests that there is a two-stage optimization of neural networks:
- Fast fitting phase: increase\(I(Z, X)\)。
- Compression phase: reduction\(I(Z, X)\) furthermore\(I(Z, Y)\)。
The illustration above visualizes the trajectory of the mutual information during the training of the neural network, with the horizontal axis representing the mutual information of the features and inputs\(I(Z, X)\), the vertical axis represents the mutual information of features and labels\(I(Z, Y)\)(The diagram is shown with\(T\) (denoting features), from purple to yellow denoting from 0 epoch to 10000 epoch. as can be seen in the figure, as training proceeds, the\(I(Z, X)\) There is a process of increasing and then decreasing.
Illustration from the paper[1703.00810] Opening the Black Box of Deep Neural Networks via Information. Reference reading:Anatomize Deep Learning with Information Theory | Lil'Log。
So can this phenomenon be exploited to regularize the training of neural networks, so the Variational Information Bottleneck (VIB) method has been proposed with the optimization objective:
We hope\(Z\) Can predict as accurately as possible\(Y\)And at the same time forget as much as possible\(X\) in the message. In other words, we want the\(Z\) oblivion\(X\) redundant information in the predictions, retaining only that which is essential to the\(Y\) useful information. Here.minimize (computing)\(I(Z, X; \boldsymbol{\theta})\) has the effect of regularizing。
Unfortunately, direct estimation of mutual information from high-dimensional data is difficult, and the idea of the solution to the variational information bottleneck is to realize the estimation of mutual information through variational approximation.
Minimize I(Z, X)
Use a mutual message of the following form\(I(Z, X)\):
Noting the need for\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\)A more convenient way to deal with this is to use theprobabilistic encoderInstead of the traditional deterministic encoder, i.e.\(X \mapsto Z\) is a stochastic function rather than a traditional deterministic function. Referring to the approach taken in VAE, we will\(p(\boldsymbol{z}|\boldsymbol{x})\) predefined as a parameterized Gaussian distribution and use the neural network to output the parameters of this Gaussian distribution:
solved\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\)The next question is how to solve\(p(\boldsymbol{z})\). The sampling estimation approach, Monte Carlo (MC) estimation, may come to mind:
However, the authors of the paper did not use this method, probably because they thought that the variance estimated here with MC was too large and that it would be inefficient to require a large number of samples in order to estimate it correctly. In order to estimate the expected\(\mathbb{E}_{(\boldsymbol{x}, \boldsymbol{z}) \sim p(\boldsymbol{x}, \boldsymbol{z})}\left[\log\frac{p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})}{p(\boldsymbol{z})}\right]\)It is important to start with the\(p(\boldsymbol{x})\) mid-sample\(\boldsymbol{x}\)and then from the\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\) mid-sample\(\boldsymbol{z}\)The function values in square brackets cannot be solved directly. Even more problematic is that the values of the function in square brackets cannot be solved directly analytically either, and need to be sampled and estimated first\(p(\boldsymbol{z})\) before it can be calculated. Sampling too many estimation processes, the variance of the estimate will naturally be large.
The variational information bottleneck, as the name implies, is created by thevariational approximationmethod to solve the problem of not being able to get\(p(\boldsymbol{z})\) of the problem. Suppose there is an unparameterized distribution of known form\(q(\boldsymbol{z})\)It's the same as\(p(\boldsymbol{z})\) It's very close, so let's use this\(q(\boldsymbol{z})\) Replace the formula in the\(p(\boldsymbol{z})\)It is possible to approximate the mutual information.\(I(Z, X)\) ? It would be good to include here the\(q(\boldsymbol{z})\) is defined as the standard Gaussian distribution, i.e.\(q(\boldsymbol{z}) := N(\boldsymbol{z}, \boldsymbol{0}, \boldsymbol{I})\)。
Next it is necessary to justify this substitution, and referring to the experience of the derivation in VAE, we try to use the\(q(\boldsymbol{z})\) interchangeability\(p(\boldsymbol{z})\)and try to put together a KL with the extra parts:
For the first item.\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\) cap (a poem)\(q(\boldsymbol{z})\) all have analytic formulas, so the function in square brackets can be calculated as an analytic solution. Using the\(p(\boldsymbol{x}, \boldsymbol{z}) = p(\boldsymbol{x})p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\), could have written the first item better:
this one\(R(Z, X; \boldsymbol{\theta}) := \mathbb{E}_{\boldsymbol{x} \sim p(\boldsymbol{x})}[\mathrm{KL}[p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta}) \parallel q(\boldsymbol{z})]]\) Often referred to as rate, which is the rate in rate-distortion theory, Rate can be optimized using mini-batch gradient descent, specifically, sampling a batch of samples from the training set\(\boldsymbol{x}_1, \ldots, \boldsymbol{x}_N\)The minimum number of each\(\boldsymbol{x}_i\) (used form a nominal expression)\(\mathrm{KL}[p(\boldsymbol{z}|\boldsymbol{x}_i; \boldsymbol{\theta}) \parallel q(\boldsymbol{z})]\) This is sufficient. Since both distributions are Gaussian, the KL here has an analytic solution:
A detailed derivation can be found inFrom Great Likelihood Estimation to Variational Autocoders - VAE Formula DerivationIn the section "Analytic solution of KL dispersion". The advantage of the "nice writing" compared to the original form is that the function has the same effect on the\(\boldsymbol{z}\) of the integral can be solved analytically, such that estimating with MC\(R(Z, X; \boldsymbol{\theta})\) When doing so, you only need to start from the\(p(\boldsymbol{x})\) mid-sample\(\boldsymbol{x}\)The system does not need to be re-installed from the\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\) mid-sample\(\boldsymbol{z}\), which reduces the error introduced by sampling.
For the second term, notice that the function in the expected square brackets follows the\(\boldsymbol{x}\) It's okay, therefore:
If I had to prove it in detail it would be:
So this item is that KL dispersion to be cobbled together. Since we can't get\(p(\boldsymbol{z})\) of the analytic formula, the term KL dispersion cannot be optimized directly; it is placed here only to justify the variational approximation, as detailed below.
In summary, mutual information\(I(Z, X)\) It can be split into two parts:
From the non-negativity of the KL dispersion, it follows that rate\(R\) Mutual information\(I(Z, X; \boldsymbol{\theta})\) of the upper boundary:
This is exactly what we want, because we want to minimize mutual information.\(I(Z, X; \boldsymbol{\theta})\)So we can pass theMinimize its upper bound \(R(Z, X; \boldsymbol{\theta})\) to indirectly minimize mutual information and achieve the "curve to save the country".
Maximize I(Z, Y)
Use a mutual message of the following form\(I(Z, X)\):
Distribution of labels\(p(\boldsymbol{y})\) It may be impossible to know: if\(\boldsymbol{y}\) is the category label, then the discrete distribution\(p(\boldsymbol{y})\) is easier to seek; but if\(\boldsymbol{y}\) is numerical, continuous distribution\(p(\boldsymbol{y})\) is harder to find. But harder to find\(p(\boldsymbol{y})\) does not affect the optimization process because
included among these\(\mathrm{H}(Y)\) denotes a random variable\(Y\) of the information entropy (entropy) of a label. Since the label\(Y\) comes from the dataset and is not an optimization variable, so the\(\mathrm{H}(Y)\) is a constant value and does not affect the optimization process.
The next thing to address is\(p(\boldsymbol{y}|\boldsymbol{z}; \boldsymbol{\theta})\) The problem of hard to find. There is a need here to compare with the previous section\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\) The situation is distinguishable from that of the\(X\) is the data in the dataset.\(Z\) are optimizable features, so for\(X \mapsto Z\) For this process, we can arbitrarily specify\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\) the form of the\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\) Not hard to find. And\(Y\) is the data in the dataset for\(Z \mapsto Y\) The process.\(p(\boldsymbol{y}|\boldsymbol{z}; \boldsymbol{\theta})\) the form of which is objectively determined and which we are not at liberty to specify.\(p(\boldsymbol{y}|\boldsymbol{z})\) is hard to find.
A distribution of known form can be used\(q(\boldsymbol{y}|\boldsymbol{z}; \boldsymbol{\phi})\) approximate\(p(\boldsymbol{y}|\boldsymbol{z}; \boldsymbol{\theta})\):
Using the non-negativity of the KL dispersion, one can obtain the mutual information\(I(Z, Y; \boldsymbol{\theta})\) A lower boundary of the\(I_{\text{BA}}\)It is known as the Barber & Agakov lower bound for mutual information.
leave it (to sb)\(p(\boldsymbol{y}, \boldsymbol{z}) = \int_x p(\boldsymbol{x}, \boldsymbol{y}, \boldsymbol{z}) \mathrm{d}\boldsymbol{x}\) obtainable
as\(Y\) is continuous data (regression problem), then the Gaussian distribution model is chosen as the approximate distribution\(q(\boldsymbol{y}|\boldsymbol{z}; \boldsymbol{\phi})\)Maximize\(\log q(\boldsymbol{y}|\boldsymbol{z}; \boldsymbol{\phi})\) corresponds to minimizing the MSE loss. If\(Y\) is discrete data (categorical problem), then the Bernoulli distribution (binary model) or the category distribution (multicategorical model) model is chosen as the approximate distribution\(q(\boldsymbol{y}|\boldsymbol{z}; \boldsymbol{\phi})\)Maximize\(\log q(\boldsymbol{y}|\boldsymbol{z}; \boldsymbol{\phi})\) corresponds to minimizing the cross-entropy loss. The detailed derivation process can be found inFrom Great Likelihood Estimation to Variational Autocoders - VAE Formula DerivationThe section "Reconstructing Losses".
\(N\) means sampling from the dataset\(N\) training data\((\boldsymbol{x}_1, \boldsymbol{y}_1), \ldots, (\boldsymbol{x}_N, \boldsymbol{y}_N)\)。\(M\) means that for each sample\(\boldsymbol{x}_i\)From distribution\(p(\boldsymbol{z}|\boldsymbol{x}_i; \boldsymbol{\theta})\) mid-sample\(M\) characteristic\(\boldsymbol{z}\) to calculate\(M\) Sub MSE/cross entropy loss.
Some understanding
Overall, maximizing\(I(Z, Y)\) corresponds to minimizing the cross-entropy loss and minimizing the\(I(Z, X)\) corresponds to minimizing the KL dispersion regular term (i.e., rate\(R\))。
Variational information bottlenecks versus ordinary discriminant models:
- The deterministic encoder in the ordinary discriminative model is changed to a probabilistic encoder, given that\(\boldsymbol{x}\)The ordinary discriminant model will give a unique\(\boldsymbol{z}\)The VIB's\(\boldsymbol{z}\) is sampled from some distribution and is a random variable.
- A KL dispersion regularization term is added (i.e., rate\(R\)), the posterior distribution of the desired features\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\) as close as possible to the standard Gaussian distribution.
From these two improvements, the variational information bottleneck is very similar to VAE.
Why does minimizing the KL dispersion work as a regularization term? Why is encouraging proximity to the standard Gaussian distribution a regularization effect? If the KL regularization term is 0, the\(p(\boldsymbol{z}|\boldsymbol{x}; \boldsymbol{\theta})\) It is exactly the standard Gaussian distribution, which does not contain any information about the sample\(\boldsymbol{x}\) information, i.e., completely forgetting the\(\boldsymbol{x}\) information. Of course, such a feature does not have any discriminative power, so it needs to be adjusted by adjusting the weight coefficients\(\beta\) to strike a balance between forgetfulness and predictability.
Further, noting that
Thus in minimizing the regular term\(R(Z, X; \boldsymbol{\theta})\) When not only minimizing mutual information\(I(Z, X; \boldsymbol{\theta})\)and in minimizing\(\mathrm{KL}[p(\boldsymbol{z}) \parallel N(\boldsymbol{0}, \boldsymbol{I})]\)that makes the features\(Z\) distribution of\(p(\boldsymbol{z})\) gradually converge to the standard Gaussian distribution. The standard Gaussian distribution has a number of excellent properties, for example, its dimensions are independent of each other, which is what encourages the characterization of the\(Z\) The decoupling of the dimensions of the
bibliography
Original paper: Deep Variational Information Bottleneck
- OpenReview (ICLR 2017)
- arXiv
From variational coding to information bottlenecks to normal distributions: on the importance of forgetting - ScienceSpace
Variational Information Bottleneck - Sphinx Garden
Migration learning: variational upper and lower bounds for mutual information - orion-orion - Blogland; Migration learning: variational upper and lower bounds for mutual information - Orion's Articles - Knowledgeable