Learning Artificial Intelligence from Zero - Python-Pytorch Learning (II)

preamble

The study of math is separate from the computation of math, and now when you go back and look at high math and linear algebra in college, if you just study it, it's actually a 3 day course and you're done with it.
The reason school classes take so long is really for exams, which means preparation for calculations. Calculations that make sense don't make a lot of sense when there are computers.
So, if you didn't do well in college math, it only takes a week to make up for it. Even if you didn't go to college, if you went to middle school, again, it only takes a week to learn high math and linear algebra.
But, the problem is that there is no one to give you lessons like this, and there is no such information for you to learn. At least there is no such information for learning in the country, which is full of learning patterns that delay our learning efficiency.

Gradient

The previous post introduced a bit of gradient, forward propagation, and backward propagation. Here's a little more detail.
Don't be intimidated by these terms, the essence of all terms is to summarize, and the terms that summarize are, in fact, the ones that hinder our learning the most; we have to hate it, but we don't have to be afraid of it.
Let's first look at the use of the parameter requires_grad, the code is as follows:

print("============ finding gradient 1 ==============")
a = (3) # Here is randn not rand : Generate random numbers that follow a standard normal distribution (mean 0, standard deviation 1).  : Generate random numbers that follow a uniform distribution (in the interval [0, 1)).
print(a)
b=a+2
print(b) # Output tensor is the result of a+2 calculation
x = (3,requires_grad=True) # Here it is randn not rand
print(x)
y=x+2
print(y) # Output tensor is the result of x+2 calculation, also record the function, grad_fn=<AddBackward0> means it is an additive function grad=gradient fn=Function

Here a and x are the modes where requires_grad is turned on and where requires_grad is not started, respectively. As shown below:

x with requires_grad turned on has an additional attribute requires_grad=True.
After y=x+2 calculated y, there is an additional attribute grad_fn=< AddBackward0 >, where grad=gradient fn=Function, which means gradient function; and Add in there means addition.
And this y=x+2, this calculation is the forward propagation, and the forward propagation is this bunch of functions that we've defined.

Introduction to the Normal Distribution

The normal distribution was mentioned above and is briefly explained here.
Normal Distribution A distribution is said to have a mean of 0 if it is centered on 0. If the mean is 0 and the standard deviation is 1, the distribution of data points across intervals follows the 68-95-99.7 rule
Mean 0, standard deviation 1 (data in 68% [-1, 1] 95% [-2, 2] 99.7% [-3, 3]). That is, the data is centered on 0 and spreads out in both directions, with 68% of the data points lying within 1 standard deviation of the mean (i.e., the [-1, 1] interval) 95% of the data points lie within 2 standard deviations of the mean (i.e., the [-2, 2] interval). 99.7% of the data points lie within 3 standard deviations of the mean (i.e., the [-3, 3] interval).
(mean 0, standard deviation 2, data at 68% [-2, 2] 95% [-4, 4] 99.7% [-6, 6]).

Scalar functions and backpropagation

Above we used forward propagation and set up a function like y=x+2, now we are adding a forward propagation function: z=yy2, then set up the scalar function, and finally perform the backpropagation.
Note 1: Before using backward, a scale value (scalar value, i.e., the constant C) such as z=() must be given, or a weight tensor must be given, and passing scalar functions is described here first.
Note 2: The scalar function is the loss function that computes the loss in forward propagation.
Note 3: A scalar function is actually a scalar, or a constant, or a constant familiar, or a value, or a number. (It would be low to say that a number was passed here, but to say that a scalar was passed would be significantly higher, which is the perfect embodiment of how nomenclature prevents us from learning).

x = (3,requires_grad=True)
print(x)
y=x+2
z=y*y*2
print(z) # Here we add the attribute, grad_fn=<MulBackward0>, where mul means multiplication.
z=() # Specify the scalar function
# You must specify a scalar function here, if you remove z=() you will get a message that grad can be implicitly created only for scalar outputs.
print(z) # Attribute grad_fn=<MeanBackward0>, Mean stands for mean function
() # Backward propagation If requires_grad=False, execute() throws an exception because grad_fn is not recorded.
print()

Run the following figure:

A brief description of the code is as follows

x is the tensor with automatic derivation enabled.
y = x + 2, y is still a tensor with automatic derivation enabled.
t = y * y * 2 and t is a tensor with automatic derivation enabled.
z = (), z is a scalar (since mean() returns the average of the tensor, the result is a scalar).
Call () to compute the gradient of z with respect to x and store it in .
In this case, the gradient of z with respect to x is stored in dz/dx; the structure of the gradient is the same as the structure of x. The gradient is the same as the gradient of x.

Gradient Clear 0

The gradient needs to be cleared before the second calculation of the gradient (call backward()).
If the gradient is not cleared before the second call to backward(), the gradient computed by the second call to backward() is superimposed on the gradient computed by the first.

print("============ clearing grad ==============")
weights =(4,requires_grad=True)
for epoch in range(3).
    model_output =(weights*3).sum()#Set scalar value, here is hyphenated, separate is a=weight*3 model_output=()
    model_output.backward()
    print(model_output.grad)
    model_output.zero_()#You can comment this line, look at the effect of not clearing zero

weight

Failure to set the scalar or weights before calculating the gradient (calling backward()) will report an error.
Having talked about scalars above, let's now introduce weighting.
The code is as follows:

x = (3,requires_grad=True)
print(x)
y=x+2
z=y*y*2
v = ([1.0,2.0,3.0],dtype=torch.float32)
(v)
print()

Here again () adds an assignment weight operation before the call.
What is meant by assigning weights is to pass a parameter when using (), which is a tensor (tensor); this tensor has the same structural requirements as x.
When calculating the gradient, the result of the gradient calculation, x, the weight tensor, is presented in full and multiplied.
Since the structure of the gradient,x,weight tensor is the same, multiplying the corresponding elements should be better understood.

a concrete calculation

y=x+2 dy/dx=y derivative=1
z=2y² dz/dy=z guide=4y
dz/dx=(dz/dy)(dy/dx)=14y=4y
Since y=x+2 so 4y=4(x+2)
The weighting is three elements are 14(x1+2) 24(x2+2) 34(x3+2)
The gradient is obtained by bringing in x.
As shown below, 4(0.8329+2) = 11.3316,the chart below is 11.3317, there should be a rounding here.

Portal:
Learning Artificial Intelligence from Zero - Python-Pytorch Learning (I)

That's it for basic learning.

Note: This post is original, please contact the author for authorization and attribution for any form of reproduction!

If you think this article is still good, please click [Recommend] below, thank you very much!

/kiba/p/18348414