Mathematical Foundations of Machine Learning

differentiation and integrationOperations play a crucial role in the field of machine learning, the
Not only is it at the heart of many of the underlying algorithms and models, but it also profoundly affects model optimization, performance evaluation, and the development of new algorithms.

Mastering calculus not only gives us one more way to calculate, it also helps to understand how various machine learning algorithms and models find optimal parameters.

1. Why do we need calculus?

Some people may find calculus difficult, probably because we are usually basically calculatingset rigidly in placestuff (using addition, subtraction, multiplication and division is enough).
For example:

Calculate prices and discounts when ordering takeout;
Calculate the time needed to go somewhere based on distance and transportation;
As well as estimates of area and volume when placing things, etc.

And for things that are constantly changing, we find it difficult to accurately describe, using traditional calculations, the
For example:

Calculate the distance traveled by a car over a period of time. How to calculate the relationship between distance and speed when the speed of the car in real life is constantly changing (and irregularly, depending on road conditions, load and many other factors).
Forecasting population growth. Population growth is a continuous process of change, influenced by a variety of factors such as birth rates, death rates, migration rates, etc., and the number of people at a future point in time is projected on the basis of data on past population growth.
An analysis of the timing of buying and selling stock trades. Predicts at which price points it is more likely to go up or down in the future, based on past stock price movements.

differentiation and integrationwhich is essentially a form of arithmetic (like addition, subtraction, multiplication, division, exponentiation, logarithms, etc.).
Compared to other arithmetic methods, it has the advantage of being able topreciselyof describing how things change.

2. Differential

Calculus actually consists ofan infinitesimalcap (a poem)accumulated points (in sports, at school etc)Two operations which are inverses of each other, like addition and subtraction, multiplication and division.

an infinitesimalTo study the rate of change of a function in the neighborhood of a point, and theaccumulated points (in sports, at school etc)The study of the cumulative effect of a function over an interval.
Normally I use morean infinitesimal, also known asfind the derivative of (a function)。

2.1. what is differentiation

differentiation and integrationThere was also a "struggle" over the birth of theSir Isaac Newton (1642-1727), British mathematician and physicistcap (a poem)Leibnitz (name)Both claimed to have invented calculus first, though ultimatelySir Isaac Newton (1642-1727), British mathematician and physicistWinning, but both are very great scientists.
It is also evident that European technology was already quite advanced at that time, which led to the urgent need for calculus, a new way of doing arithmetic, in scientific research.

Here's an example of speed versus time to see how final calculus can help us accurately calculate change.
First, for bothuniform velocitymoving objects.

Time (\(t\))	Speed (\(v_1\))	Speed (\(v_2\))
0	5	8
1	5	8
3	5	8
5	5	8
10	5	8

With a constant speed, you can tell at a glance who's fast and who's slow, and you don't need calculus.

Next, look.uniform velocityThe situation:

Time (\(t\))	Speed (\(v_1\))	Speed (\(v_2\))
0	0	0
1	2	3
3	6	9
5	10	15
10	20	30

included among these\(v_1=2t\)，\(v_2=3t\)。
In this case, we can still see which of the two objects is fast and which is slow, and we can also calculate theiraccelerationsconsist of2cap (a poem)3, no calculus required.

Finally, look atinhomogeneous speedsituation, which is the closest thing to reality.
As you can understand if you have actual driving experience, it's almost impossible to keep auniform velocitymaybeuniform velocity, There are all sorts of factors that affect speed, and the throttle you can control is only one of them.
Simulate two non-uniformly varying speeds:

Time (\(t\))	Speed (\(v_1\))	Speed (\(v_2\))
0	0	0
1	10	1
3	90	27
5	250	125

included among these\(v_1=10t^2\)，\(v_2=t^3\)。
Now, it's not so easy to see which velocity is increasing faster, is it? It's not so easy to calculate the acceleration of these two objects at a given moment, is it?

As you can see from the graph though, 10 seconds ago, the\(v_1\)(particle used for comparison and "-er than")\(v_2\)Come on. Ten seconds later.\(v_2\)(particle used for comparison and "-er than")\(v_1\)Quick.
still\(v_1\)cap (a poem)\(v_2\)Which one is growing faster? It's not so easy to tell, even from the graph above.
At this point, calculus will tell you who's changing fast.
The rules for calculating differentiation are described in the next section, so let's look at the results of differentiation here:\(v_1^{'} = 20t\)；\(v_2^{'}=3t^2\)。
The result of differentiating the velocity is the change in velocity:

After differentiating, you can see exactly how the 2 speeds change:

Before the purple intersection, the\(v_1\)Percentage increase\(v_2\)sharp (of knives or wits)
After the purple intersection, the\(v_2\)Percentage increase\(v_1\)sharp (of knives or wits)

2.2 Calculation rules

As can be seen from the above example, for complex cases of variation (non-uniform variation), the
The pattern of changes can be quickly identified by differentiation, so that the actual values at each point in time can be accurately calculated.

Calculating differentiation is not as intuitive as adding, subtracting, multiplying and dividing, but it is not complicated either.
For polynomials, the law of differentiation is as follows: where\(a\)is constant;\(n\)variable\(x\)The index of the

For other special functions (e.g. trigonometric functions, logarithmic functions, etc.) see Wikipedia's table of differentials:
/wiki/Differentiation_rules

2.3 Chain rule

There is a useful property in the differential algorithm known as thechain rule。
This rule is very useful when it is necessary to compute the differentiation of a nested function, for example:

One way to do this is to place\(y\)The polynomial substitution of\(f\)function:

Another way to do this is tochain rule，

function (math.)\(f\)treat (sb a certain way)\(x\)The derivation, which can be converted to\(f\)treat (sb a certain way)\(y\)Derivation and\(y\)treat (sb a certain way)\(x\)derivativesolve (or integrate) an ordinary differential equation (math.). From this it follows:

The calculations are the same in both ways, the
However, if the function\(f\)cap (a poem)\(y\)If it's all very complicated, applyingchain rule, which can greatly simplify differential operations.

3. Partial differentiation

In machine learning algorithms, it is not possible to have only one variable in a formula\(f(x)\)It's basically multiple variables.\(f(x_0,x_1,...,x_n)\)The situation.
In this case, how to calculate the function\(f\)What about the trends of each variable in the

This uses partial differentiation (also known as partial derivatives), which is the function\(f\)Take the derivative with respect to one of the variables.

3.1 Calculation rules

Once you understand the calculus of differentiation, then the calculation of partial differentiation is easy. For example, a function containing two variables:

So, to the\(x_0\)cap (a poem)\(x_1\)The results of taking the derivatives separately are as follows:

Simply put, for\(x_0\)To find the derivative, put\(x_1\)treated as a constant; for\(x_1\)To find the derivative, put\(x_0\)Treat it as a constant.

Through partial differential calculations, we can find out which variable changes with respect to the function\(f\)of the results had the greatest impact.
The corresponding scenario is a machine model (\(f\)) in which attribute (\(x_0, x_1\)) had the greatest impact on the results of the model.

3.2. Graphical significance of partial differentials

For partial differentials, it is also possible to look at the image, although images with more than 3 dimensions cannot be plotted, so only functions with 2 parameters can be plotted.
Draws the function in the example above:\(f(x_0,x_1)=x_0^{2}+2x_0x_1+3\)

Then separately for the\(x_0,x_1\)Solve for the partial differential, for\(x_0\)The partial differential of is:\(f^{'}(x_0)=2x_0+2x_1\)
\(x_1\)Taking different values, the image of the partial differential is:

As can be seen from the figure, the partial differential\(f^{'}(x_0)\)The rate of change is linearly increasing in the\(x_1\)It just affects its starting value.
If the\(f(x_0,x_1)=x_0^{2}+2x_0x_1+3\)viewed as a machine learning model.
Then with the property\(x_0\)The increase in the\(x_0\)The impact on the model is increasing.

treat (sb a certain way)\(x_1\)The partial differential of is:\(f^{'}(x_1)=2x_0\)，\(x_0\)Taking different values, the image of the partial differential is:

As can be seen from the figure, the partial differential\(f^{'}(x_1)\)The rate of change of the\(x_0\)Decide what its rate of change is.
If the\(f(x_0,x_1)=x_0^{2}+2x_0x_1+3\)viewed as a machine learning model.
Then with the property\(x_1\)The magnitude of the effect on the model is stable, and how much depends on the\(x_0\)The value of the fetch.

4. Summary

I usually contact the machine learning algorithms, differential use more, so here only introduced the relevant operations of differential, integral is the inverse operation of differential, here will not repeat.
For integrals of complex functions, there are also tables of integrals available (/wiki/Lists_of_integrals）。

Finally, summarize in which machine learning algorithms calculus operations are encountered.

gradient descent algorithm: Used to find a local minimum of a function. It does this by computing the gradient (i.e., partial derivative) of the loss function with respect to the model parameters and updating the parameters in the opposite direction of the gradient.
backpropagation algorithm: Based on the chain rule, the gradient is computed layer by layer from the output layer to the input layer, thus updating the weights and biases in the network.
Design of the loss function: Both the mean square error loss function and the cross-entropy loss function are derivable, and their gradients can be easily computed by calculus.
regularization technique: To prevent model overfitting, regularization terms are often used to reduce the risk of overfitting by penalizing the complexity of the model. In this case, the design of the regularization terms relies on calculus, as it is necessary to compute their gradients with respect to the model parameters in order to take into account their effects in the optimization process.
Probabilistic models and Bayesian methods: In probabilistic models for machine learning, calculus is used to compute probability distributions, conditional probabilities, marginal probabilities, and expectations.
Feature selection and dimensionality reduction: In feature selection and dimensionality reduction techniques such as principal component analysis (PCA) and linear discriminant analysis (LDA), calculus is used to compute covariance matrices, eigenvalues, eigenvectors, etc., of the data to help identify the most important features or to reduce the dimensionality of the data.
Others.