Two important functions in machine learning

Two function names are often seen in machine learning:sigmoidcap (a poem)softmax。
The former recurs in neural networks and is also known as the activation function of a neuron;
The latter is found in many classification algorithms, especially in multi-categorization scenarios, and is used to determine which classification result has a higher probability.

This paper focuses on the definition of these two functions, their morphology, their role in the algorithm, and the connection between the two functions.

1. sigmoid function

1.1 Definition of functions

sigmoidFunctions A collective term for a class of functions, commonlysigmoidFunctions are available:\(y=\frac{1}{1+e^{-x}}\)
It is sometimes referred to asS-function (math.), because its image shows up asS-shapedThe.

x = (-10, 10, 100)
y = 1 / (1 + (-x))

(figsize=(6, 4))
(x, y)
("S-function (math.)")
(True)

()

As can be seen from the graph, theS-function (math.)The output of the function will be controlled within a limited range (between 0 and 1 for the function above).
It is really this property that makes it well suited to represent probabilities or to be used in the output layer of a binary classification problem.

Attention.sigmoidThe output of the function does not have to be in the interval(0,1)Middle.
For example, there's also a commonly usedS-function (math.)：\(tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}\), its output interval is(-1,1)。

1.2 Application scenarios

sigmoidThe main usage scenarios for functions are:

Logistic regression algorithm:sigmoidFunctions can be used to convert the output of a linear regression model into probability values for use in binary classification problems. The probability value of the model output indicates the likelihood that the sample belongs to a particular category.
Activation function for neural networks: it helps neural networks to learn complex decision boundaries and increase the expressive power of the model through nonlinear transformations.
Gating mechanisms: inLSTM(long and short-term memory networks) in recurrent neural networks such as thesigmoidFunctions (or variants thereof) are used as part of a gating mechanism to control the flow of information.

sigmoidFunctions play an important role in machine learning and early deep learning, especially when dealing with binary classification problems and as activation functions in neural networks.
However, with the development of deep learning, other activation functions have gradually replaced thesigmoidfunctions in certain scenarios.

2. softmax function

2.1 Definition of functions

following introductionsoftmaxfunction.softmaxfunction is a widely used function in machine learning and deep learning, especially in scenarios dealing with multiple classification problems.
And the one described abovesigmoidfunction is more often used in binary classification scenarios.

softmaxThe main purpose of the function is to convert aK-Wayvector (which typically represents the raw predicted scores for each category) is converted into a vector with elements all in the range of(0, 1)amongK-Wayvector and the sum of all elements is1。
This passage is a bit abstract, so as an example, let's say there is a 3-dimensional vector:\((x_1,x_2,x_3) = (3,1,-2)\)
where the value of each element is out of the interval(0, 1)in which the sum of all elements is also not1。

So.softmaxHow are functions converted?
First, find the individual elements ofexpThe and:\(m=e^{x_1}+e^{x_2}+e^{x_3}\)。
Then, the vector\(x\)Convert to vector\(y\):\((y_1,y_2,y_3)= (\frac{e^{x_1}}{m},\frac{e^{x_2}}{m},\frac{e^{x_3}}{m})\approx(0.876,0.118,0.006)\)
converted\(y\)The value of each element of the vector is in the interval(0, 1)and the sum of all the elements is1。

softmaxFunctions can also draw graphs.

from mpl_toolkits.mplot3d import Axes3D


def softmax(x0, x1, x2):
    m = (x0) + (x1) + (x2)
    return (x0) / m, (x1) / m, (x2) / m


count = 30
x0 = (-10, 10, count)
x1 = (-5, 5, count)

y = ((count, count, 3))
for i0 in range(count):
    for i1 in range(count):
        y[i1, i0, :] = softmax(x0[i0], x1[i1], 1)

xx0, xx1 = (x0, x1)
(figsize=(10, 4))

ax1 = (1, 2, 1, projection="3d")
ax1.plot_surface(xx0, xx1, y[:, :, 0], color="g")
ax1.set_xlabel("$x_0$", color="g")
ax1.set_ylabel("$x_1$", color="g")
ax1.set_zlabel("$y_0$", color="g")

ax2 = (1, 2, 2, projection="3d")
ax2.plot_surface(xx0, xx1, y[:, :, 1], color="r", cstride=1)
ax2.set_xlabel("$x_0$", color="r")
ax2.set_ylabel("$x_1$", color="r")
ax2.set_zlabel("$y_1$", color="r")
=-1

plt.tight_layout()
()

As can be seen in the figure, the\(y_0,y_1\)is mapped to the interval(0, 1)Center.

2.2 Application scenarios

softmaxfunctions can be applied:

Multi-categorization problems: it is the standard output layer activation function when dealing with multi-categorization problems. It is able to convert the original output of the model (usually the output of the linear layer) into a probability distribution, which facilitates subsequent training using the cross-entropy loss function.
Output layer of a neural network: often used as an activation function for the output layer when building neural networks for classification tasks. In particular, in convolutional neural networks (CNN), recurrent neural networks (RNN) and its variants are used in generating the final category predictions.
Reinforcement learning: in some reinforcement learning scenarios, it can be used to convert the Q-value (i.e., an estimate of the value of an action) into a probability of selecting each action, thus realizing a probability-based action selection strategy.
Natural Language Processing: used to compute the attention weights that determine which parts of the model should be given more attention when processing the input.

softmaxFunctions are important tools in machine learning and deep learning for dealing with multiclassification problems, generating probability distributions, and making probabilistic decisions.

3. Linkages

Finally, the relationship between the two functions is analyzed.
Based on the previous presentation, thesigmoidfunction is suitable for binary classification problems.softmaxfunction is suitable for multi-categorization problems.
So.sigmoidWould the function besoftmaxWhat about a simplified version of the function?

Suppose that a variable with only twosoftmaxfunction, then where\(y_0=\frac{e^{x_0}}{e^{x_0}+e^{x_1}}\)，
Multiply the numerator and denominator at the same time\(e^{-x_0}\)Available:\(y_0=\frac{e^{x_0}e^{-x_0}}{e^{x_0}e^{-x_0}+e^{x_1}e^{-x_0}}=\frac{e^{x_0-x_0}}{e^{x_0-x_0}+e^{x_1-x_0}}= \frac{1}{1+e^{-(x_0-x_1)}}\)
suppose that...\(y=y_0, x = x_0-x_1\), available:\(y=\frac{1}{1+e^{-x}}\)，
This is a typicalsigmoidfunction.

Therefore, we can assume thatsoftmaxfunction is a function that takes thesigmoidfunction obtained after extending it to multiple variables.