Two function names are often seen in machine learning:sigmoid
cap (a poem)softmax
。
The former recurs in neural networks and is also known as the activation function of a neuron;
The latter is found in many classification algorithms, especially in multi-categorization scenarios, and is used to determine which classification result has a higher probability.
This paper focuses on the definition of these two functions, their morphology, their role in the algorithm, and the connection between the two functions.
1. sigmoid function
1.1 Definition of functions
sigmoid
Functions A collective term for a class of functions, commonlysigmoid
Functions are available:\(y=\frac{1}{1+e^{-x}}\)
It is sometimes referred to asS-function (math.), because its image shows up asS-shapedThe.
x = (-10, 10, 100)
y = 1 / (1 + (-x))
(figsize=(6, 4))
(x, y)
("S-function (math.)")
(True)
()
As can be seen from the graph, theS-function (math.)The output of the function will be controlled within a limited range (between 0 and 1 for the function above).
It is really this property that makes it well suited to represent probabilities or to be used in the output layer of a binary classification problem.
Attention.sigmoid
The output of the function does not have to be in the interval(0,1)Middle.
For example, there's also a commonly usedS-function (math.):\(tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}\), its output interval is(-1,1)。
1.2 Application scenarios
sigmoid
The main usage scenarios for functions are:
- Logistic regression algorithm:
sigmoid
Functions can be used to convert the output of a linear regression model into probability values for use in binary classification problems. The probability value of the model output indicates the likelihood that the sample belongs to a particular category. - Activation function for neural networks: it helps neural networks to learn complex decision boundaries and increase the expressive power of the model through nonlinear transformations.
- Gating mechanisms: in
LSTM
(long and short-term memory networks) in recurrent neural networks such as thesigmoid
Functions (or variants thereof) are used as part of a gating mechanism to control the flow of information.
sigmoid
Functions play an important role in machine learning and early deep learning, especially when dealing with binary classification problems and as activation functions in neural networks.
However, with the development of deep learning, other activation functions have gradually replaced thesigmoid
functions in certain scenarios.
2. softmax function
2.1 Definition of functions
following introductionsoftmax
function.softmax
function is a widely used function in machine learning and deep learning, especially in scenarios dealing with multiple classification problems.
And the one described abovesigmoid
function is more often used in binary classification scenarios.
softmax
The main purpose of the function is to convert aK-Wayvector (which typically represents the raw predicted scores for each category) is converted into a vector with elements all in the range of(0, 1)
amongK-Wayvector and the sum of all elements is1
。
This passage is a bit abstract, so as an example, let's say there is a 3-dimensional vector:\((x_1,x_2,x_3) = (3,1,-2)\)
where the value of each element is out of the interval(0, 1)
in which the sum of all elements is also not1
。
So.softmax
How are functions converted?
First, find the individual elements ofexp
The and:\(m=e^{x_1}+e^{x_2}+e^{x_3}\)。
Then, the vector\(x\)Convert to vector\(y\):\((y_1,y_2,y_3)= (\frac{e^{x_1}}{m},\frac{e^{x_2}}{m},\frac{e^{x_3}}{m})\approx(0.876,0.118,0.006)\)
converted\(y\)The value of each element of the vector is in the interval(0, 1)
and the sum of all the elements is1
。
softmax
Functions can also draw graphs.
from mpl_toolkits.mplot3d import Axes3D
def softmax(x0, x1, x2):
m = (x0) + (x1) + (x2)
return (x0) / m, (x1) / m, (x2) / m
count = 30
x0 = (-10, 10, count)
x1 = (-5, 5, count)
y = ((count, count, 3))
for i0 in range(count):
for i1 in range(count):
y[i1, i0, :] = softmax(x0[i0], x1[i1], 1)
xx0, xx1 = (x0, x1)
(figsize=(10, 4))
ax1 = (1, 2, 1, projection="3d")
ax1.plot_surface(xx0, xx1, y[:, :, 0], color="g")
ax1.set_xlabel("$x_0$", color="g")
ax1.set_ylabel("$x_1$", color="g")
ax1.set_zlabel("$y_0$", color="g")
ax2 = (1, 2, 2, projection="3d")
ax2.plot_surface(xx0, xx1, y[:, :, 1], color="r", cstride=1)
ax2.set_xlabel("$x_0$", color="r")
ax2.set_ylabel("$x_1$", color="r")
ax2.set_zlabel("$y_1$", color="r")
=-1
plt.tight_layout()
()
As can be seen in the figure, the\(y_0,y_1\)is mapped to the interval(0, 1)
Center.
2.2 Application scenarios
softmax
functions can be applied:
- Multi-categorization problems: it is the standard output layer activation function when dealing with multi-categorization problems. It is able to convert the original output of the model (usually the output of the linear layer) into a probability distribution, which facilitates subsequent training using the cross-entropy loss function.
- Output layer of a neural network: often used as an activation function for the output layer when building neural networks for classification tasks. In particular, in convolutional neural networks (
CNN
), recurrent neural networks (RNN
) and its variants are used in generating the final category predictions. - Reinforcement learning: in some reinforcement learning scenarios, it can be used to convert the Q-value (i.e., an estimate of the value of an action) into a probability of selecting each action, thus realizing a probability-based action selection strategy.
- Natural Language Processing: used to compute the attention weights that determine which parts of the model should be given more attention when processing the input.
softmax
Functions are important tools in machine learning and deep learning for dealing with multiclassification problems, generating probability distributions, and making probabilistic decisions.
3. Linkages
Finally, the relationship between the two functions is analyzed.
Based on the previous presentation, thesigmoid
function is suitable for binary classification problems.softmax
function is suitable for multi-categorization problems.
So.sigmoid
Would the function besoftmax
What about a simplified version of the function?
Suppose that a variable with only twosoftmax
function, then where\(y_0=\frac{e^{x_0}}{e^{x_0}+e^{x_1}}\),
Multiply the numerator and denominator at the same time\(e^{-x_0}\)Available:\(y_0=\frac{e^{x_0}e^{-x_0}}{e^{x_0}e^{-x_0}+e^{x_1}e^{-x_0}}=\frac{e^{x_0-x_0}}{e^{x_0-x_0}+e^{x_1-x_0}}=
\frac{1}{1+e^{-(x_0-x_1)}}\)
suppose that...\(y=y_0, x = x_0-x_1\), available:\(y=\frac{1}{1+e^{-x}}\),
This is a typicalsigmoid
function.
Therefore, we can assume thatsoftmax
function is a function that takes thesigmoid
function obtained after extending it to multiple variables.