C# Getting Started with Deep Learning: 10,000 Words on Calculus and Gradient Descent

Tutorial name: Getting Started with Deep Learning in C#

Author: Kouryou the Idiot

Address:

differentiation and integration
- - extreme boundary
  - derivative
    - derivative equation (math.)
    - Multiply and divide to find the derivative example
    - Chain rule for derivatives of composite functions
    - Derivatives of Sigmoid Functions
    - Minimization problems
  - an infinitesimal
  - accumulated points (in sports, at school etc)
  - partial derivative
    - Definitional domain of a multivariate function
    - Values of multivariate functions
    - Limits of multivariate functions
    - partial derivative
    - full differential
    - Finding the minimum value of a partial derivative
    - Lagrange multiplier method
  - gradient
    - directional derivative
    - gradient
    - The basic formula for the gradient descent method
    - Hamiltonian operator $\bigtriangledown$
    - Gradient descent method to find an approximation of the minimum value

differentiation and integration

Since learning calculus requires some basic knowledge, but because this tutorial is not a math book, so can not be one by one detailed introduction to the basics, the reader needs to understand the basics of learning elementary functions, trigonometric functions, etc. on their own.

extreme boundary

The symbol for the limit is$\lim$ , in higher mathematics, mainly series limits and function limits, limited to space, this paper only discusses some cases when the function exists in the limit.

Mathematically there is positive infinity ($+\infty$ ) and the concept of negative infinity ( $ -\infty $ ), we all know the meaning of infinity, but it is easier to understand the wrong infinitesimal, negative infinity, theInfinitesimal means infinitely close to 0, not negative infinity.

As an example you'll see when $ x \to + \infty$$\frac{1}{x}$ We all know that the larger the value of x$\frac{1}{x}$ The smaller it gets, the closer it gets to 0, which is impossible.

To solve for limits, you will generally encounter these cases, what is y when x is infinite.

For example, as shown in the figure below, as x gets infinitely large, y gets progressively closer to the x-axis, i.e., y gets closer and closer, we use$y\to 0$ Indicates that it is approaching 0 or close to 0.

Image from Chapter 1, Section 3, Definition and Calculation of the Limit of a Function, Higher Mathematics, edited by the Department of Mathematics, Tongji University.

So:

\[\lim_{x \to \infty} f(x) = \lim_{x \to \infty} \frac{1}{x} = 0 \]

utilizationC# representation, we use an extremely large number to represent infinity.

var x = ();
var y = 1 / x;

var lim = (int)<double>();
(lim);

The above uses the<double>() To convert a tensor to a scalar, we can also use the function().ToInt32(); Conversion.

Then again, as shown in the figure below, as x goes to infinity, y gets closer and closer to$\frac{\pi}{2}$ So:

\[\lim_{x \to \infty} \arctan x = \frac{\pi}{2} \]

Image from Chapter 1, Section 3, Definition and Calculation of the Limit of a Function, Higher Mathematics, edited by the Department of Mathematics, Tongji University.

When finding the limit above, it is the case when $\lim_{x \to \infty} $ or $\lim_{x \to 0} $. In practice it is more common to be given a point and find its limit, for example:

\[\lim_{x \to x_{0}} f(x) = \lim_{x \to x_{0}} \frac{1}{x} \]

(coll.) fail (a student)$x=1$ When we calculate it directly, we can actually gety=1and the limit is 1. Or, in other words, we are asking for a function in$x_{0}$ In the limit, if you can directly compute the$y_{0}$ value, then that value is the limit at that point.

Calculating the limit of such a function is simple, since it can be done directly through the$y=f(x)$ Calculated.

The following questions were and are two questions from Tongji University's First Book of Advanced Mathematics.

When x solves for 0, the numerator is 0. 0 divided by any number is 0, so the limit is 0?

When it comes to this kind of$x\to0$ A numerator or denominator of 0 is not straightforward to calculate. The process of answering these two questions:

Since this article is not a math tutorial, the details will not be discussed in depth here.

In advanced mathematics, there are two very important kinds of limits:

\[\lim_{x \to 0} \frac{\sin x}{x} = 1 ,x \in (0,\frac{\pi }{2}) \]

\[\lim_{x \to 0} (1 + x)^{\frac{1}{x}} = e \]

derivative

Given a function, how do you calculate the rate of change of the function over some interval?

As shown in the figure, the function$y = x^{2}$ interval (math.)$[1,3]$ The starting point A and the ending point B of the

Then the average rate of change is:

\[\frac{\bigtriangleup y}{\bigtriangleup x} = \frac{9-1}{3-1} = \frac{8}{2} = 4 \]

But when this$\bigtriangleup{x}$ maybe$\bigtriangleup{y}$ Things get very complicated when they are very small. If we ask$x=9$ The average rate of change in the neighborhood, then:

\[\frac{y + \bigtriangleup y}{x+ \bigtriangleup x} = \frac{9 + \bigtriangleup y}{3 + \bigtriangleup x} \]

(coll.) fail (a student)$\frac{\bigtriangleup y}{\bigtriangleup x}$ is very small, which actually reflects the fact that the function in$x=9$ The instantaneous rate of change at the time. Then this instantaneous rate of change can be represented by a tangent line across points A and B.

The tangent line is a line that gently touches a point of the function, and from the figure, as x gets larger, the$y_{2} = x+1$ (particle used for comparison and "-er than")$y_{1} = x$ Much larger, for example$5^2$ 、 $4^{2}$ 、 $3^{2}$ The difference between that and the others is getting bigger and bigger.

Then the tangent line can reflect this rate of change. As shown in the figure, the angle of the tangent line at point B is larger than the tangent line at A. The tangent line at point B is larger than the tangent line at point A.

As a result, a new function has emerged called the derivative of the original function, or simply the derivative, which is also a function by which the instantaneous rate of change of the original function at any point can be calculated.

There are various symbols for the representation of derivatives, for example:

\[\frac{\bigtriangleup y}{\bigtriangleup x} = f'(x) = y' = \frac{dy}{dx} = \frac{df(x)}{x}= \frac{df}{x} \]

d is the differentiation symbol, e.g., dy is the differentiation with respect to y, dx is the differentiation with respect to x.

If a request is made for a point$x_{0}$ the instantaneous rate of change, then:

\[\frac{\bigtriangleup y}{\bigtriangleup x} \big|_{x_{0}} = f'(x) \big|_{x_{0}} = y' \big|_{x_{0}} = \frac{dy}{dx} \big|_{x_{0}} = \frac{df(x)}{x} \big|_{x_{0}} = \frac{df}{x} \big|_{x_{0}} \]

The reader should have some basic math, right? The first two should be easy to understand, and the last three are important, and we will be using this approach a lot in integrals and calculus.

We can understand it like this:

\[dy = \bigtriangleup y \]

\[dx = \bigtriangleup x \]

In Pytorch, we can perform calculations with a differential system, for example, we want to compute$d(x^2) \big|_{x=3}$ 。

// Define the y = x^2 function
var func = ( x) => (2);

var x = (3.0, requires_grad: true);
var y = func(x);

// Calculate the derivative
().

// Convert to scalar value
var grad = ().ToDouble();
(grad).

Make no mistake, after calculating the derivative, use x to output the derivative value, not y, which is the result of the function. Why don't you just output the derivative when you calculate it? Because Pytorch's automatic derivatives system is very complex and calculates partial derivatives, and for a unitary function, the partial derivative with respect to x is the derivative with respect to y. We'll talk more about this later, when we talk about partial derivatives and gradients.

In addition, when creating a tensor type for x, you need to add therequires_grad: true Parameters.

derivative equation (math.)

Here are some basic derivation formulas from Tongji University's Advanced Mathematics.

For example, we find$y = x^2$ The derivative, using equation (2) above, yields$y = 2x$。

The derivatives for complex and complicated functions can be troublesome and will not be repeated here. For complex functions, there are also higher-order derivatives, i.e., derivatives of derivatives, the second-order derivative formula is as follows:

\[f''(x) = y'' = \frac{d^2y}{dx^2} = \frac{d^2f(x)}{x^2}= \frac{d^2f}{x^2} \]

Multiply and divide to find the derivative example

The main examples are multiplicative derivatives, quotient derivatives, and several types of exponential derivatives.

① Find the derivative of the following function.

\[f(x) = e^x \cos x \]

Solution:

\[\begin{align} f'(x) &= (e^x \cos{x})' \\ &= (e^x)'\cos{x} + e^x (\cos{x})' \\ &= e^x \cos{x} - e^x \sin{x} \end{align} \]

Find the derivative of the following function:

\[y = \frac{x+1}{\ln x} \]

Solution:

\[\begin{align} y' &= \frac{(x+1)'\ln{x} - (x+1)(\ln{x})}{(\ln{x})^2} \\ &= \frac{\ln{x} - (x+1)^{\frac{1}{x}}}{(\ln{x})^2} \\ &= \frac{x \ln{x} -(x+1)}{x(\ln{x})^2} \end{align} \]

Chain rule for derivatives of composite functions

in the event that$y=f(u)$ at a certain point$u$ The place can be guided.$u=g(x)$ at a certain point$x$ out to be derivable, then the composite function$y=f[g(x)]$ at a certain point$x$ The place can be guided and has:

\[\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} \]

If the function is more complex, it can also be generalized to the case of finitely many composite functions, for example:

\[\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dx} \]

Example question to find$y = e^{2x}$ The derivative function of the

found$u=2x$ Then:

\[y' = (e^u)' = e^u \cdot (u)' = e^{2x} \cdot (2x)' = 2e^{ex} \]

Derivatives of Sigmoid Functions

After the above, we know that by the formula for the derivation of a composite function:

\[\big (\frac{u}{v} \big )' = \frac{u' \dot v - u \dot v'}{v^2} \]

So for$\big ( \frac{1}{f(x)} \big )'$ The derivation of such functions can be derived:

\[\big ( \frac{1}{f(x)} \big )' = - \frac{f'(x)}{f(x)^2} \]

Sigmoid function$σ(x)$ is one of the best known activation functions in neural networks and is defined as follows:

\[\sigma (x) = \frac{1}{1+e^{-x}} \]

When learning gradient descent later, you need to derive the Sigmoid function, and it is easier to use the following formula to do so:

\[\sigma '(x) = \sigma (x)(1 - \sigma (x) ) \]

Of course, you can also use the derivation of fractions to derive it slowly.

\[\begin{align} \sigma '(x) &= (\frac{1}{1+e^{-x}})' \\ &= \frac{1}{1+e^{-x}} - \frac{1}{(1+e^{-x})^2} \\ &= \frac{1}{1+e^{-x}} (1 - \frac{1}{1+e^{-x}}) \\ &= \sigma (x)(1 - \sigma (x) ) \end{align} \]

Minimization problems

The slope of a function has the property that the point a obtains an extreme value when the slope of the function is 0, i.e.$f'(a) = 0$ The tangent line to this point is parallel to the x-axis.

draw$y = 3x^4 - 4x^3 - 12x^2 + 32$ The graphs of are shown below, and from the graphs it is clear that respectively$x = -1$ 、 $0$ ， $2$ There are extremes at all three points when the slope is 0 , where when$x = 2$ When the function achieves a minimum value, the function has no maximum value.

So if we are given a function, what if we obtain all the extremes and the minimum maximum value of this function? Here you can use the threading method.

The function is first derived and simplified.

\[\begin{align} y' &= (3x^4 - 4x^3 - 12x^2 + 32)' \\ &= 12x^3 - 12x^2 - 24x \\ &= 12x(x^2 - x - 2) \\ &= 12x(x + 1)(x - 2) \end{align} \]

It follows that the function in$x = -1,0,2$ The slopes of the three points are 0, and then the values at each of the three points are calculated and made as shown below.

The derivative is then calculated as positive or negative in the interval, for example when$x<-1$ When, since the derivative results in a negative number, the$f(x)$ Decreasing Interval. Based on the positive and negative of the derivative, determine the$f(x)$ of increasing and decreasing intervals and then make a new table.

The next step is simple, trace the points on the axes according to the values of the three points, and finally connect the lines according to the decreasing intervals, so the minimum value is 0.

Hand-drawn graphs don't need to be accurate, the main thing is just to know the increasing and decreasing intervals and the extremes.

e951f1d6da6d09477880fc13f655b23

an infinitesimal

Below is a diagram from Tongji University's Advanced Mathematics.

From the figure, we can see that in square A, the area is$A = (x_{0})^2$ and the area of the large square is$(x_{0}+ \bigtriangleup x)^2$ , or by adding the areas of multiple rectangles to derive the area of a large square as:

\[S = x_{0}^2 + 2x_{0} \bigtriangleup x + (\bigtriangleup x)^2 \]

Then, after the side lengths have increased$\bigtriangleup x$ How much did the area increase when the

\[\bigtriangleup S = 2x_{0} \bigtriangleup x + (\bigtriangleup x)^2 \]

We can use the following equation for the expression for the increment when $y = f(x)$ satisfies a certain relationship:

\[\bigtriangleup y = A \bigtriangleup x + O(\bigtriangleup x) \]

Earlier in the lecture on derivatives, we knew that$\bigtriangleup y =f(x + \bigtriangleup x) - f(x)$ So:

\[\bigtriangleup y =f(x + \bigtriangleup x) - f(x) = A \bigtriangleup x + O(\bigtriangleup x) \]

(coll.) fail (a student)$\bigtriangleup x$ Very hourly, and$A \not= 0$ can be ignored when$O(\bigtriangleup x)$ We use $A \bigtriangleup x $ to approximate the value of $ \bigtriangleup y $ , which is the definition of differentiation, where$A = f'(x)$ 。

\[dy = f'(x)\bigtriangleup x \]

As an example, find$y = x^3$ exist$x=1$ when$\bigtriangleup x = 0.01$ cap (a poem)$\bigtriangleup x = 0.001$ of the increment.

The essence of the above question is to find$x = 1.01$ cap (a poem)$x = 1$ hour$\bigtriangleup y$ as well as$x = 1.001$ cap (a poem)$x = 1$ hour$\bigtriangleup y$ 。

Seek first:

\[(1.01)^3 = 1.030301 \]

\[(1.001)^3 = 1.003003001 \]

So the two increments are conveniently 0.030301 and 0.003003001.

But if we only need to ask for an approximation, then we use differentiation to find it, by first finding the derivative:

\[y' = dy = (x^3)' = 3x^2 \]

So:

\[dy = 3x^2 \bigtriangleup x \]

the reason why$\bigtriangleup x = 0.01$ When there is:

\[dy = 3*(1)^2 * 0.01 = 3 * 0.01 = 0.03 \]

the reason why$\bigtriangleup x = 0.001$ When there is:

\[dy = 3*(1)^2 * 0.001 = 3 * 0.001 = 0.003 \]

So the increment of the function can be approximated in this way by differentiating dy.

Because:

\[\bigtriangleup y = \frac{dy}{dx} \]

We use the dy approximation instead$\bigtriangleup y$ , which is one of the application scenarios of differentiation.

accumulated points (in sports, at school etc)

Having introduced derivatives earlier, we know that$y = x^3$ The derivative of$y = 3x^2$。

Then conversely, we know that a function$F(x)$ The derivative of$y=x^3$ , for the power function, it is easy to invert$\frac{1}{4} x^4$ The derivative of$x^3$ But$\frac{1}{4} x^4 + 1$ 、 $\frac{1}{4} x^4 + 666$ The derivatives of both$x^3$ So.$x^3$ of the original function is indeterminate, so the integral formula derived by inversion, also known as the indefinite integral, we use$C$ to represent this uncertain constant.

Suppose the original function is F(x) and the derivative is$f(x)$ , since the constant will be eliminated in the derivation, it is necessary to appear to add this indeterminate constant when solving the integral, so:

\[\int f(x)dx = F(x) + C \]

Here are some integral formulas given in Advanced Mathematics at Tongji University.

The role of differentiation was described earlier, and a simple application scenario of derivatives in the plane is also given here.

As shown in the figure, the figure is$y = x^2$ The closed region of the function, and$x=0$ 、 $x=2$ Two straight lines enclose a closed region, find the area of the closed region enclosed by ABC.

First find its original function as$y = \frac{1}{3}x^3$ . Use the integral interval to represent the area of the solution:

\[\int_{1}^{2} x^2=\frac{1}{3}x^3 \big|_{1}^{2} =\frac{1}{3}2^3 - \frac{1}{3}1^3=\frac{7}{3} \]

For the problem solved above, the integral formula is used, as shown in the following equation, with ∫ denoting the integral sign.$f(x)$ denotes the product function, and$dx$ denotes the integral variable increment (differential).$a$ cap (a poem)$b$ denotes the lower and upper limits of the integral, i.e., the integral interval.

\[\int_{a}^{b} f(x) dx \]

Here's another easy one, find the area enclosed by $y = 2x+3$ and $y = x^2$.

First find the interval of integration, i.e., the two points of intersection of the two, by$x^2=2x+3$ Gotta:

\[x^2 - 2x -3 = 0 \]

According to the cross multiplication method, it is obtained:

\[(x + 1)(x - 3) = 0 \]

the reason why$x_{1} = -1$ ， $x_{2} = 3$。

We start by finding$y = 2x + 3$ The area enclosed between these two points.

\[\int_{-1}^{3} 2x+3 = x^2+3x \big|_{-1}^{3} = (9 + 9) - (1 - 3) = 20 \]

look for$y = x^2$ in the area enclosed by these two points.

\[\int_{-1}^{3} x^2 = \frac{1}{3}x^3 \big|_{-1}^{3} = 9 - (-\frac{1}{3}) = 9 + \frac{1}{3} \]

So the enclosed area is:$20 - (9+\frac{1}{3}) = \frac{32}{3}$ 。

Mathematically, we can more conveniently represent this addition and subtraction of two functions as:

\[\int_{-1}^{3} (2x+3 - x^2) = \int_{-1}^{3} (2x+3) - \int_{-1}^{3} (x^2) \]

partial derivative

Partial derivatives belong to the differential calculus of multivariable functions and are most commonly used to solve spatial problems, which in middle and high school basically involve only univariate functions, where we introduce bivariate functions, denoted as:

\[z = f(x,y) \]

In a univariate function, the derivative is the rate of change of the function along the x-axis, while in a multivariate function due to multiple variables, the derivative cannot be calculated directly, and has to be derived with respect to the direction of one of the axes, so it is called the partial derivative, and in the next step, we will gradually learn some of the basics of partial derivatives.

Definitional domain of a multivariate function

The graph of the composition of a binary function is given below.

\[z=\sqrt{1-x^2-y^2} \]

Here's a question on how to find this$z=\sqrt{1-x^2-y^2}$ of the definitional domain?

We know that $1 \ge x^2 + y^2 $ , in the setting of the$y = 0$ When $1 \ge x^2 $ , then$-1 \le x \le 1$ As a result of$1-x^2 \ge y^2$ So.$-\sqrt{1-x^2} \le y \le \sqrt{1 - x^2}$。

So define the domain:

\[-1 \le x \le 1 \\ -\sqrt{1-x^2} \le y \le \sqrt{1 - x^2} \\ z \ge 0 \]

This function is binary and it is still relatively simple to find the domain of definition, z is$f(x,y)$ function of x, we first find the domain of definition of x and then the domain of definition of y. Extending this to$u=f(x,y,z)$ Ternary functions, where generally the domain of definition of x is a constant, the domain of definition of y consists of functions of x, and the domain of definition of z consists of functions of x, y.

Solve the volume of the space of two three-dimensional figures composed of closed space, is the use of the definite integral to find, calculate the definite integral need to know the definition of the domain, it is this kind of method, this article will not repeat.

Values of multivariate functions

known function$f(x,y) = \frac{xy}{x^2+y^2}$ seek$f(1,2)$。

It's actually quite simple and easy to use$x=1,y=2$ Just substitute in:

\[f(1,2) = \frac{2}{1^2+2^2} = \frac{1}{5} \]

Limits of multivariate functions

The previous reference to limits involves univariate functions, for the limits of multivariate functions, the calculation is a bit more complicated, we can use the following formula to represent the limit value of a binary function at a certain point.

\[\lim_{_{y \longrightarrow y_{0}}^{x \longrightarrow x_{0}}} f(x,y) = A \]

Finding the limit of a binary function is called the dual limit.

For example, find the dual limit of the following function.

\[\lim_{_{y \longrightarrow 2}^{x \longrightarrow 1}} \ln{(x+y^2)} = \ln{(1+2^2)} = \ln{5} \]

partial derivative

When finding the derivative of a multivariate function, since the function has several unknown variables, for example, $z = x^2 + y^2 $, and since there are two variables, x and y, the function has two directions of change, and when finding the derivative, we have to set which direction to go, for example, if we want to know the rate of change towards the x-axis, we have to find out the derivative of the function with respect to x. We have to find out the rate of change in the direction of the x-axis, and find the rate of change at$z=f(x_{0},y_{0})$ The derivative of x at time x, this is called the partial derivative with respect to x.

Symbols used for partial derivatives$\partial$ representation, then the partial derivative with respect to x can be denoted:

\[\frac{\partial z}{\partial x} \big|_{y=y_{0}}^{x=x_{0}} \]

Of course, there are many variants, Markdown knocking math formula is super tiring, here to paste a diagram to save trouble.

Here's how to give partial derivatives of a simple function. The method is simple: when taking partial derivatives with respect to x, just treat y as a constant.

\[z = x^2 + y^2 \]

\[\frac{\partial z}{\partial x} = 2x,\frac{\partial z}{\partial y} = 2y \]

RE:

\[z = x^2 + yx + y^2 \]

\[\frac{\partial z}{\partial x} = 2x + y,\frac{\partial z}{\partial y} = 2y +x \]

It was mentioned earlier that integrals can solve for the area of a closed region made up of two functions in the plane, and partial derivatives can compute the area of a closed region made up of three-dimensional geometry and a plane in space, so we won't go any further here.

full differential

Let a binary function$z = f(x,y)$ Then its full incremental formula is:

\[\bigtriangleup z =A\bigtriangleup x + B \bigtriangleup y + O(\beta) \]

Then about the differentiation of z:

\[dz=f_{x}(x,y)dx + f_{y}(x,y)dy \]

To find the full differential, you actually find all the partial derivatives first, and then you do the calculations.

for example, by finding$z = e^{2x+3y}$ of the full differential.

389b1a036ebf0d36c4c2233c49c3556

Give an example problem to find the function$z = f(x,y) = \frac{x^2}{y}$ at the point$(1,-2)$ out$\bigtriangleup x=0.02$ ， $\bigtriangleup y = -0.01$ The full increment at the time.

This is derived by first finding the two partial derivatives of the function.

\[dz = \frac{2x}{y} \bigtriangleup x - \frac{x^2}{y^2} \bigtriangleup y \]

commander-in-chief (military)$\bigtriangleup x=0.02$ ， $\bigtriangleup y = -0.01$ Substituting, we get$-0.0175$。

Below is an image of this function.

From the basics of differentiation and full differentiation, it is clear that the accuracy of some calculations is lost when they are performed in mathematics.

Finding the minimum value of a partial derivative

When studying derivatives, we know that when$f'(a) = 0$ When the function achieves an extreme value, the generalization to multivariate functions can also be made by taking partial derivatives to find the extreme value. For example, for the binary function$z = f(x,y)$ , an extreme value can be obtained when the following conditions are met:

\[\frac{\partial z}{\partial x} = 0,\frac{\partial z}{\partial y} = 0 \]

This is because the slopes of the x and y tangents are both 0. The reasoning process is not given here, so it is straightforward to memorize the method.

The following diagram shows the function$z=x^2 + y^2$ Find what values of x and y minimize the function.

Obviously, when x and y are both 0, the function achieves a minimum value, but we have to deduce this through math and not just draw conclusions from the image.

Find the partial derivatives first:

\[\frac{\partial z}{\partial x} = 2x \\ \frac{\partial z}{\partial y} = 2y \]

It can be seen that when$x=0,y=0$ when both partial derivatives result in 0, so that$z=f(x,y)$ Only in$(0,0)$ There is a unique extreme value at

on account of$z=x^2 + y^2 \ge 0$ So it can be seen that$z=f(0,0)$ The minimum value is obtained when

As the slopes of x and y get closer and closer to 0, you can see that the tangents to the surface are getting smoother and smoother.

Lagrange multiplier method

There is a binary function$z=(x,y)$ and additional conditions$\varphi (x,y) = 0$ , and the Lagrange multiplier method is used to solve such multivariate function extreme value problems with conditional restrictions.

The formula is as follows:

\[F(x,y,\lambda) = f(x,y) + \lambda{\varphi{(x,y)}} \]

where $\lambda $ is a parameter and the value we require for the solution to find the$\lambda$ The minimum value of z can be found.

First, the above equation is subjected to partial derivatives and the condition for 0 is found:

\[F'_{x}(x,y,\lambda )= f'_{x}(x,y) +\lambda{\varphi{'_{x}(x,y)}} = 0 \]

\[F'_{y}(x,y,\lambda ) = f'_{y}(x,y) + \lambda{\varphi{'_{y}(x,y)}} = 0 \]

\[F'_{\lambda}(x,y,\lambda ) = \varphi{(x,y)} = 0 \]

Find x and y by using the above equation,$\lambda$ After that, substitute$f(x,y)$ Find the extreme values.

example$a + b = 1$ seek$\frac{1}{a} + \frac{4}{b}$ The minimum value of the

First, the binary function is$z=f(a,b) = \frac{1}{a} + \frac{4}{b}$。

restrictive condition$\varphi (a,b) = a + b - 1=0$ 。

So:

\[F(a,b,\lambda) = f(a,b)+ \lambda{\varphi{(a,b)}} = \frac{1}{a} + \frac{4}{b} + \lambda{(a + b - 1)} \]

Now start taking partial derivatives.

\[F'_{a}(a,b,\lambda )= f'_{a}(a,b) +\lambda{\varphi{'_{a}(a,b)}} = -\frac{1}{a^2} + \lambda = 0 \qquad (1) \]

\[F'_{b}(a,b,\lambda ) = f'_{b}(a,b) + \lambda{\varphi{'_{b}(a,b)}} = -\frac{4}{b^2} + \lambda= 0 \qquad (2) \]

\[F'_{\lambda}(a,b,\lambda ) = \varphi{(a,b)} = a + b - 1= 0 \qquad (3) \]

Solve by (1), (2), (3):

substitute into$z=f(a,b) = \frac{1}{a} + \frac{4}{b}$ Finding the$z_{min} = f(\frac{1}{3},\frac{2}{3}) = 9$ , so the minimum value is 9.

gradient

In this section, we will learn the gradient descent method, one of the key points inside the deep learning, gradient descent method to learn more knowledge, the content of this paper is basically for the gradient descent method to do padding.

Baidu Encyclopedia: The directional derivative essentially studies the problem of the rate of change of a function at a point along a particular direction, and the gradient reflects the maximum value and direction of the trend of change of a spatial variable.

directional derivative

The derivatives were mentioned earlier, and in the case of one-dimensional functions, the$y=f(x)$ , the derivative is a reflection of its rate of change at a given point, while at$z = f(x,y)$ In the two partial derivatives $\frac{\partial z}{\partial x} $ ,$\frac{\partial z}{\partial y}$ The partial derivative reflects the rate of change of a function along a direction parallel to the x-axis and y-axis. The partial derivative reflects the rate of change in the direction of a particular axis, while the directional derivative is the rate of change in a particular direction, not a particular axis direction.

As shown above, set$l$ It's a line with$P(x,y)$ A ray drawn from the$Q(x + \bigtriangleup x,y + \bigtriangleup y)$ be$l$ A point on the set$\rho$ be$P$ 、 $Q$ The distance between two points, then:

\[\frac{\bigtriangleup z}{\rho} \]

This formula reflects the function in the$P$ 、 $Q$ Between the two points along the$l$ The average rate of change in the direction if when$Q$ converge to$P$ When the limit exists, the limit is called the point$P$ following the direction$l$ The directional derivatives of the

As:

\[\bigtriangleup x = \rho \cos \alpha , \bigtriangleup y = \rho \cos \beta \]

So the directional derivative can be expressed as:

\[\begin{align} \frac{\partial z}{\partial l} &= \\ &= \frac{\partial z}{\partial x} \bigtriangleup x + \frac{\partial z}{\partial y} \bigtriangleup y \\ &= \frac{\partial z}{\partial x} \cos \alpha + \frac{\partial z}{\partial y} \cos \beta \end{align} \]

If you use the$i$ 、 $j$ denotes the component on x, y and can also be expressed as:

\[\frac{\partial z}{\partial l} = \frac{\partial z}{\partial x}i + \frac{\partial z}{\partial y}j \]

If we use a vector representation, it can also be expressed as:

\[(\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y}) \]

gradient

The gradient is the direction in which the value of the function grows fastest, and the gradient descent learned later is the opposite, the direction in which the value of the function falls fastest.

A point in space when the point$P$ Direction when fixed$l$ The directional derivative of the function, $\frac{\partial u}{\partial l} $, changes as it changes, illustrating that for a fixed point, the function changes at different rates in different directions. Then for the point$P$ , in what direction can the rate of change of the function be maximized? The concept of gradient needs to be introduced here.

The picture below shows a hemisphere.

Q. How do you get to the top the fastest, given any point? Obviously, going up vertically will get you to the bottom the fastest, but it's not straightforward to conclude this for uneven images in practice, but we can discuss it briefly here first.

Like the graph above, given a differentiable binary function$z = f(x,y)$ A little.$(x_{0},y_{0})$ , this point can go in various directions, each with a different directional derivative, now suppose there is a direction that maximizes the directional derivative, this is the gradient $gradf(x_{0},y_{0}) $.

As shown in the figure.$A(x_{0},y_{0})$ Going in the B direction allows A to get to the vertex the fastest, i.e., with the highest rate of change. A can go in various directions, one of which is toward C.

The directional derivative in the direction toward B is the largest, which is the gradient $gradf(x_{0},y_{0}) $. As shown in the figure, any directional derivative from A has an angle with $ \overrightarrow{AB}$, and since it's in space, it's a little tricky to represent this angle as a cosine value in each direction, which we also have a vector representation of:$n_{e} = (\cos \alpha ,\cos \beta)$ , then the relationship between the directional derivative, the gradient:

\[\frac{\partial z}{\partial l} =gradf(x_{0},y_{0}) \cdot n_{e} \]

\[\frac{\partial z}{\partial l} = \frac{\partial z}{\partial x} \cos \alpha + \frac{\partial z}{\partial y} \cos \beta = gradf(x_{0},y_{0})\cdot n_{e} \]

As shown in the following figure, when$\alpha = 0$ when$gradf(x_{0},y_{0})$ cap (a poem)$e_{1}$ The overlap is due to the fact that$\cos \alpha = 1$ and so the directional derivative also reaches its maximum value$|gradf(x_{0},y_{0})|$ . That is, the directional derivative along the direction of the gradient can be maximized.

So:

\[\begin{align} gradf(x_{0},y_{0}) &= \frac{\partial z}{\partial x} \cos \alpha + \frac{\partial z}{\partial y} \cos \beta \\ &= \frac{\partial z}{\partial x}i + \frac{\partial z}{\partial y}j \\ &= (\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y}) \end{align} \]

Example, find the function$z = \ln(x^2 + y^2)$ The gradient of the

Source : Workbook of Advanced Mathematics, Chen Zhaodou.

One more topic of practical significance.

Source : Workbook of Advanced Mathematics, Chen Zhaodou.

utilizationC# Solve the problem to get:

// Define the value of the u = x^2 + y^2 + z^2 function at the point (2,1,-1)
var x = (2.0, requires_grad: true); var y = (1.0, requires_grad: true); // define u = x^2 + y^2 + z^2

var z = (-1.0, requires_grad: true); var u = (2) + (2) + (1)


// Derivation
().

var ux = ;
var uy = ;
var uz = ;

($"gradu(2,1,-1) = {"{"}{().ToDouble()},{().ToDouble()},{().ToDouble()} {"}"}"););

gradu(2,1,-1) = {4,2,-2 }

The basic formula for the gradient descent method

The reader is advised to read this article so that it is easy to understand what gradient descent is:/question/434600945

As mentioned earlier, the gradient is the fastest upward, then the gradient downward is the fastest downward, and the opposite of the gradient is the fastest.

Gradient descent is the weapon of neural networks, and I believe that the knowledge of gradient descent is most often found when you are learning about deep learning, so this subsection will explain some of the basics of gradient descent.

In the section on minimizing partial derivatives, we learned that the following conditions need to be met for a minimum value:

\[\frac{\partial z}{\partial x} = 0,\frac{\partial z}{\partial y} = 0 \]

If the gradient can be calculated directly from the partial derivatives, then the problem is simple, directly calculate the minimum value, are for practical scenarios to be calculated is more likely, especially inside the neural network. So the big boys use another method to find an approximation of the minimum, called gradient descent.

Draw a three-dimensional image as shown:

If you are in the highest position and you are blindfolded, you have to move from the top to the bottom position, one frame at a time.

We want to descend to the bottom as fast as possible, surely we have to choose the most apostate path, but because of the blindfold, can not jump over a grid to know the location of the back of the grid, so we can only first from the nearby grids after comparing to find the most apostate grids, and then take the next step. But it's impossible to walk through all the grids at once, right? You can choose a few grids first, and then judge which one is the most apprentice, then take the next step, and then choose a few more grids, and then take the next step.

We know this from our earlier study of gradients:

\[\frac{\partial z}{\partial l} = \frac{\partial z}{\partial x} \bigtriangleup x + \frac{\partial z}{\partial y} \bigtriangleup y \]

To wit:

\[\bigtriangleup z = \frac{\partial z}{\partial x} \bigtriangleup x + \frac{\partial z}{\partial y} \bigtriangleup y \]

If we take this formula as the inner product of two vectors, we can derive:

\[\bigtriangleup z = (\frac{\partial z}{\partial x},\frac{\partial z}{\partial y}) \cdot (\bigtriangleup x,\bigtriangleup y) \]

When the following vectors are in opposite directions$\bigtriangleup z$ Obtain the minimum value.

Let's recall our knowledge of vectors, the inner product of vectors achieves its minimum value when two vectors are in opposite directions. As:

\[a \cdot b = |a||b| \cos \theta \]

So the vector b is satisfied:

\[b = -ka \qquad \]

(k is a positive constant)

found$b= (\bigtriangleup x,\bigtriangleup y)$ ， $a= (\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ ， $k=\eta$ So:

\[(\bigtriangleup x,\bigtriangleup y) = -\eta (\frac{\partial z}{\partial x},\frac{\partial z}{\partial y} ) \qquad ( \eta is a positive tiny constant) \]

This formula is known as the basic formula for the gradient descent method for a two-variable function, if generalized to more than three variables:

\[(\bigtriangleup x_{},\bigtriangleup x_{2},...,\bigtriangleup x_{n}) = -\eta (\frac{\partial x}{\partial x_{1}},\frac{\partial z}{\partial x_{2}},...,,\frac{\partial z}{\partial x_{n}} ) \]

When we learned about directional derivatives and gradients earlier, we knew that the directional derivative is largest along the gradient, when the gradient is$(\frac{\partial z}{\partial x},\frac{\partial z}{\partial y})$ , that is, up is the most apostolic.

due to$(\bigtriangleup x,\bigtriangleup y)$ is the vector opposite to the gradient, so down is the fastest descent, so this is the gradient descent method to find the vector that makes the fastest descent.

Review the use of partial derivatives to find the minimum value$z=x^2 + y^2$ , find: when x changes from 1 to$1+\bigtriangleup x$ y changes from 2 to$2 + \bigtriangleup y$ Find the vector that makes this function decrease the fastest at time$(\bigtriangleup x,\bigtriangleup y)$。

First find the partial derivatives:

\[\frac{\partial z}{\partial x} = 2x \\ \frac{\partial z}{\partial y} = 2y \]

derived from the basic formula of the gradient descent method:

\[(\bigtriangleup x,\bigtriangleup y) = -\eta (2x,2y ) \qquad (\eta is a positive tiny constant) \]

By the meaning of the question when$x=1$ 、 $y=2$ When it comes down to it:

\[(\bigtriangleup x,\bigtriangleup y) = -\eta (2,4 ) \qquad \]

($\eta$ is a positive infinitesimal constant)

In this subsection, there is also a$\eta$ There is no explanation, it is a very small positive number, like a grid in a downhill problem, i.e., the number of steps moved. When using a computer for calculations, it is necessary to determine a suitable$\eta$ Value.$\eta$ Values that are too small or too large can cause problems, and in neural networks, the$\eta$ is called the learning rate, and there is no clear way to derive the$\eta$ value, the only way to find the right value is through repeated experimentation.

Hamiltonian (math.)$\bigtriangledown$

When the gradient descent method is generalized to more than one variable, the following formulas will show to be very complex:

\[(\bigtriangleup x_{},\bigtriangleup x_{2},...,\bigtriangleup x_{n}) = -\eta (\frac{\partial x}{\partial x_{1}},\frac{\partial z}{\partial x_{2}},...,\frac{\partial z}{\partial x_{n}} ) \]

This is why the $\bigtriangledown $ symbol is often used in math to simplify formulas.

\[\bigtriangledown f = (\frac{\partial x}{\partial x_{1}},\frac{\partial z}{\partial x_{2}},...,\frac{\partial z}{\partial x_{n}} ) \]

The substitution to the gradient descent method formula is:

\[(\bigtriangleup x_{},\bigtriangleup x_{2},...,\bigtriangleup x_{n}) = -\eta \bigtriangledown f \]

Gradient descent method to find an approximation of the minimum value

When learning the basic formula for the gradient descent method, it was mentioned that the$\eta$ So go ahead and review$z = x^2 + y^2$ of the problem, if we set the learning rate$\eta = 0.1$ , so how do we use this algorithm to find the minimum according to the gradient descent method? Suppose the initial point is$(3,2)$ , according to the gradient:

\[(\bigtriangleup x,\bigtriangleup y) = -0.1 (2x,2y ) \\ \bigtriangleup x = -0.2x \\ \bigtriangleup y = -0.2y \]

substitute into$(3,2)$ Gotta:

The first operation	current position	current position	gradient	gradient	displacement vector	displacement vector	function value
i	x	y	∂z/∂x	∂z/∂y	∆x	∆y	z
0	3.00	2.00	6.00	4.00	-0.60	-0.40	13.00

So, point$(3.00,2.00)$ It's been moved to$(2.40,1.60)$ So:

The first operation	current position	current position	gradient	gradient	displacement vector	displacement vector	function value
i	x	y	∂z/∂x	∂z/∂y	∆x	∆y	z
0	3.00	2.00	6.00	4.00	-0.60	-0.40	13.00
1	2.40	1.60

The steps such as recalculating the gradient are derived:

The first operation	current position	current position	gradient	gradient	displacement vector	displacement vector	function value
i	x	y	∂z/∂x	∂z/∂y	∆x	∆y	z
0	3.00	2.00	6.00	4.00	-0.60	-0.40	13.00
1	2.40	1.60	4.80	3.20	-0.48	-0.32	8.32

Performing the operation repeatedly will eventually allow the minimum value to be calculated, and if there are fewer steps, then the descent will be the fastest.

There are many kinds of gradient descent algorithms in Pytorch, so I won't go into them here, but readers can refer to this article if they are interested:/p/619988672

C# Getting Started with Deep Learning: 10,000 Words on Calculus and Gradient Descent

extreme boundary

derivative

derivative equation (math.)

Multiply and divide to find the derivative example

Chain rule for derivatives of composite functions

Derivatives of Sigmoid Functions

Minimization problems

an infinitesimal

accumulated points (in sports, at school etc)

partial derivative

Definitional domain of a multivariate function

Values of multivariate functions

Limits of multivariate functions

partial derivative

full differential

Finding the minimum value of a partial derivative

Lagrange multiplier method

gradient

directional derivative

gradient

The basic formula for the gradient descent method

Hamiltonian (math.)\(\bigtriangledown\)

Gradient descent method to find an approximation of the minimum value