Machine Learning: Linear Regression

Chapter Organization

Background
Root Mean Square Error MSE
least square (estimate)
gradient descent
Programming implementation

contexts

Most systems in life have input-output relationships that are linear functions, or can be approximated as such within certain limits. In some cases, it is more difficult to directly infer the relationship between input and output. Therefore, we will deduce the input-output relationship of the system from a large amount of sampled data. A typical single-input, single-output linear system can be symbolized as:

\[y=f(x)=kx+b \]

Among them.\(k\)is the slope, which responds to when the input\(x\)When changing, the output\(y\)The variation of the input with the\(x\)Ratio of changes;\(b\)reacts when the system has no inputs (or inputs are\(0\)) when the output value of the system.

The data is generally referred to asObservational datamaybeSampling dataThe two statements have a certain focus.observation (scientific etc)Tends to be an objective system, e.g., daily depth of water at high tide;sampleA subjective system is favored, e.g., applying a pressure of 10 N to a spring and observing the spring's deformation.

For but-input-single-output systems, the data can be represented as:

\[O=\{o_i\}_N=\{x_i,y_i\}_N \]

maybe

\[S=\{s_i\}_N=\{x_i,y_i\}_N \]

where the symbol\(O\)homologousobservationSymbol\(S\)homologoussampling,\(\{o_i\}_N\)center\(o_i\)denotes each element in the sampling sequence.\(N\)denotes the number of elements in the sequence.\(x_i\)indicates a system input.\(y_i\)Indicates system output

In the derivation of a system, the result of the derivation is generally referred to as an estimate or approximation of the actual system, and is notated by the notation\(\hat{y}=\hat{f}(x)\). For a single sampling point, the error of the system is defined as the difference between the true value of the output and the predicted value of the output for that sampled input is the error. It is expressed in the data equation as:

\[\varepsilon_i = y_i-\hat{y_i}=y_i-\hat{f}(x_i) \]

For the overall sampling sequence, a classical error isrms error（Mean Squared Error, MSE), whose mathematical formula is:

\[\text{MSE}=\sum_{i=1}^{N}\varepsilon_i^2 \]

In deriving the input-output relationship of a system, there are usually two methods, one is based on numerical derivation and the other is based on learning. In this paper, we explain the two methods using least squares and gradient descent as examples, respectively.

MSE

For the case of a single sampling point, the MSE degenerates to the square of the variance, i.e:

\[\text{MSE}=\varepsilon^2=(y-\hat y)^2 \]

Assumed parameters\(b\)is a constant, and considering only the MSE with respect to the parameters, there are

\[\varepsilon^2=(kx+b-y)^2=x^2(k+\frac{b-y}{x})^2 \]

Easily obtained, MSE is about\(k\)of a quadratic function that has a unique zero:\(k_0=-(b-y)/x\)

For the case of multiple points, for each point\(\{s_i\}=\{x_i,y_i\}\)，\(\varepsilon_i^2\)can all be expressed as a function of\(k\)The quadratic function that has:

\[\text{MSE}=\sum_{i=0}^{N}\varepsilon_i^2=\sum_{i=0}^{N}\big(x_i^2(k+\frac{b-y_i}{x_i})^2\big)=\sum_{i=0}^{N}\big(a_ik^2+b_ik+c_i\big)=Ak^2+Bk+C \]

I.e., the MSE of the sequence is also about the parameter\(k\)the quadratic function of and that\(MSE\geq0\), if and only if\((b-y_i)/x_i=M\)The inequality takes equality when it is a constant.

It can be easily shown that MSE is also about the parameter\(b\)quadratic function of a number (math.)

Quadratic functions with openings upward have two important properties:

The derivative is\(0\)of the point that is the point of its minimum value.

\[f(x_i)= \min{f(x)}\iff f'(x_i)=0 \]

The distance of any point from the point of minimum is proportional to the value of its derivative in the direction opposite to the direction of the derivative

\[x_i-x_{\min}\propto -f'(x_i) \]

Properties 1 and 2 are the theoretical foundation/basis for the least squares and gradient descent methods, respectively.

least square (estimate)

The least squares method is designed based on MSE and the idea is to find a set of parameters such that the MSE has a bias derivative of 0 with respect to each parameter, for the case of a unitary input, i.e:

\[\begin{align} \frac{\partial\text{MSE}}{\partial k}&=0 \tag{3.1}\\ \frac{\partial\text{MSE}}{\partial b}&=0 \tag{3.2} \end{align} \]

First simplify the formula\((3.2)\)

\[\begin{align*} \frac{\partial\text{MSE}}{\partial b}&=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial (\varepsilon_i^2)}{\partial b}\\ &=\frac{1}{N}\sum_{i=1}^{N}2\epsilon_i\cdot\frac{\partial}{\partial b} (\varepsilon_i)\\ &=\frac{2}{N}\sum_{i=1}^{N}\epsilon_i\cdot\frac{\partial}{\partial b} (kx_i+b-y_i)\\ &=\frac{2}{N}\sum_{i=1}^{N}(kx_i+b-y_i)\\ &=\frac{2}{N}\Big(k\sum_{i=1}^{N}x_i+Nb-\sum_{i=1}^{N}y_i\Big) \end{align*} \]

By the formula\((3.2)\)There is:

\[\begin{align*} \frac{2}{N}\Big(k\sum_{i=1}^{N}x_i+Nb-\sum_{i=1}^{N}y_i\Big)&=0\\ b&=\frac{1}{N}\Big(\sum_{i=1}^{N}y_i-k\sum_{i=1}^{N}x_i\Big) \tag{3.3} \end{align*} \]

Next, simplify the formula\(3.1\)

\[\begin{align*} \frac{\partial\text{MSE}}{\partial k}&=\frac{1}{N}\sum_{i=1}^{N}\frac{\partial (\varepsilon_i^2)}{\partial k}\\ &=\frac{1}{N}\sum_{i=1}^{N}2\epsilon_i\cdot\frac{\partial}{\partial k} (\varepsilon_i)\\ &=\frac{2}{N}\sum_{i=1}^{N}\epsilon_i\cdot\frac{\partial}{\partial k} (kx_i+b-y_i)\\ &=\frac{2}{N}\sum_{i=1}^{N}x_i(kx_i+b-y_i)\\ &=\frac{2}{N}\Big(k\sum_{i=1}^{N}x_i^2+b\sum_{i=1}^{N}x_i-\sum_{i=1}^{N}x_iy_i\Big) \end{align*} \]

Substituting into the formula\((3.1),(3.3)\)There is:

\[\begin{align*} \frac{2}{N}\Big(k\sum_{i=1}^{N}x_i^2+b\sum_{i=1}^{N}x_i-\sum_{i=1}^{N}x_iy_i\Big)&=0\\ k\sum_{i=1}^{N}x_i^2+\frac{1}{N}\sum_{i=1}^{N}x_i\Big(\sum_{i=1}^{N}y_i-k\sum_{i=1}^{N}x_i\Big)-\sum_{i=1}^{N}x_iy_i&=0\\ k\Big(\sum_{i=1}^{N}x_i^2-\frac{1}{N}\big(\sum_{i=1}^{N}x_i\big)^2\Big)&=\sum_{i=1}^{N}x_iy_i-\frac{1}{N}\sum_{i=1}^{N}x_i\sum_{i=1}^{N}y_i\\ k&=\frac{N\sum x_i^2-\big(\sum x_i\big)^2} {N\sum x_iy_i-\sum x_i\sum y_i} \tag{3.4} \end{align*} \]

formulas\((3.3),(3.4)\)which is the parametric formula for the least squares method

gradient descent

For beginners learning machine learning, we begin by discussing the simplest case: learning based on a single sample point.

Quadratic functions have the important property that the distance of any point from the point of minimum value is proportional to the value of its derivative

\[x_i-x_{\min}\propto -f'(x_i) \]

Based on this property, we can can design the parameter update equation as follows

\[\begin{align*} \Delta k_t&=-\lambda\frac{\partial\varepsilon_i^2}{\partial k}\\ &=-\lambda(2\varepsilon_i\frac{\partial\varepsilon_i}{\partial k})\\ &=-\lambda(2\varepsilon_i x_i) \end{align*} \]

\[\begin{align*} \Delta b_t&=-\lambda\frac{\partial\varepsilon_i^2}{\partial b}\\ &=-\lambda(2\varepsilon_i\frac{\partial\varepsilon_i}{\partial b})\\ &=-\lambda(2\varepsilon_i) \end{align*} \]

Therefore, there is a parameter update equation:

\[\begin{align*} \varepsilon_i&=y-(kx_i+b_i)\tag{4.1}\\ k&:=k-\lambda(2\varepsilon_i x_i) \tag{4.2}\\ b&:=v--\lambda(2\varepsilon_i)\tag{4.3} \end{align*} \]

included among these\(\lambda\)is the learning rate, generally taken as\(0.1\sim10^{-6}\)

a constant (math.)\(2\)is defaultable and can be considered as a learning rate amplified by a factor of two.

Programming implementation

The reader is advised to create header files and define functions as follows
: Define variable types
random_point.h: Generate random points
least_square.h: Least Squares Implementation
gradient_descent.h: Implementation of the gradient descent method

type definition

First we need to define the sample points, and the sample point sequence type.
The sampling point is the point that contains the\(x\)、\(y\)The data type of the two values. Also, for ease of use, define the aliasPoint
A sequence of sampling points, or data, can be stored as a sequence of data of typePoint(used form a nominal expression)vector

struct SamplePoint{
  float x;
  float y;
}
using Point = SamplePoint;

using Data = std::vector<Point>;

For straight lines, which contain\(k\)，\(b\)two arguments, and also, for ease of invocation, define the bracket operator()heavy load (on a truck)

struct LinearFunc{
  float k;
  float b;
  float operator()(float x){
    return k*x+b;
  }
}
using Line = LinearFunc;
using Func = LinearFunc;

Data generation

adoptionrandomlibrarynormal_distributionrandom number engine

#include <random>
#include <cmath>
#include ""

Data generatePoints(const Func& func, float sigma, float a, float b, int numPoints) {
    Data points;
    std::random_device rd;
    std::mt19937 gen(rd());
    // std::uniform_real_distribution<> distX(a, b); // uniform distribution
    std::normal_distribution<> distX((a + b) / 2, (b - a) / 2.8); // normal distribution
    std::normal_distribution<> distY(0, sigma);

    for (int i = 0; i < numPoints; ++i) {
        float x = distX(gen);
        float y = func(x) + distY(gen);
        points.push_back({ x, y });
    }

    return points;
}

The method accepts five inputs, which are:

func: function, independent variable\(x\)with the independent variable\(y\)exclusionary rule
sigma：\(y\)The variance of the error between the observed value and the true value of the
a、b: The reference upper and lower bounds of the generated data range determine the width of the generated data, while the vast majority of the data will lie in this interval
numPointsNumber of points

least square (estimate)

The least squares method takes only one input: the dataDataand return data at the same time.

\[\begin{align*} k&=\frac{N\sum x_i^2-\big(\sum x_i\big)^2} {N\sum x_iy_i-\sum x_i\sum y_i} \tag{3.4}\\ b&=\frac{1}{N}\Big(\sum_{i=1}^{N}y_i-k\sum_{i=1}^{N}x_i\Big) \tag{3.3} \end{align*} \]

In the implementation, it is necessary to iterate through the sampled data and perform the accumulation calculation separately\(\sum x_i\)、\(\sum y_i\)、\(\sum x_i^2\)cap (a poem)\(\sum x_iy_i\)

Line Least_Square(const Data& data) {
  Line line;

  float s_x = 0.0f;
  float s_y = 0.0f;
  float s_xx = 0.0f;
  float s_xy = 0.0f;

  float n = static_cast<float>(());

  for (const auto& p : data) {
    s_x += ;
    s_y += ;
    s_xx +=  * ;
    s_xy +=  * ;
  }
	
   = (n * s_xy - s_x * s_y) / (n * s_xx - s_x * s_x);
   = (s_y -  * s_x) / n;

  return line;
}

gradient descent

Gradient descent is a learning method. The estimates of the parameters are gradually moved closer to the optimal estimates. This is shown in this example by the gradual decrease in MSE.
A single-step iteration is first realized, in which all the sampled data are traversed and the parameters are corrected based on the parameter update formula.

\[\begin{align*} \varepsilon_i&=y-(kx_i+b_i)\tag{4.1}\\ k&:=k-\lambda(2\varepsilon_i x_i) \tag{4.2}\\ b&:=v--\lambda(2\varepsilon_i)\tag{4.3} \end{align*} \]

The gradient descent method requires a given initial value, and one way to do this for linear functions, other than artificially generating, randomized initial values, is to assume a positively proportional function in order to estimate the\(k\), which is assumed to be a constant function to estimate\(b\)The formula is as follows:

\[\begin{align*} k_0&=\sum y_i/\sum x_i \tag{5.1}\\ b_0&=\sum y_i/ N \tag{5.2} \end{align*} \]

In this example, it is set to obtain the final estimate after 100 iterations on the initial value, which can be adjusted by the reader according to the actual situation, and the number of iterations is generally in the case of a suitable design of the learning degree of\(50\sim200\)substandard

#include ""

constexpr float eps = 1e-1;
constexpr float lambda = 1e-5;

void GD_step(Func& func, const Data& data) {
  for (const auto& p : data) {
    float error = func() - ;
     -= lambda * error * ;
     -= lambda * error;
  }
}

Func Gradient_Descent(Func& func, const Data& data) {
  float s_x = 0, s_y = 0;
  for (const auto& p : data) {
    s_x += ;
    s_y += ;
  }

  Line line;
   = s_y / s_x;
   = s_y / ();

  float lambda = 1e-5f;

  for (size_t _ = 0; _ < 100; _++) {
    GD_step(line, data);
  }

  return line;
}

appendice

nan issues

There are two reasons why this problem arises, incorrect parameter update symbols and high learning rate.

Parameter update symbol error
In updating formulas, if the + sign is used incorrectly, or if you adopt the\(\hat y-y\)count\(\varepsilon_i\), all of which will cause the parameters to be updated in the direction of greater error, and after several iterations the distance from the true value grows further and further, ultimately producing a NAN.

\[k:=k-\lambda(2\varepsilon_i x_i) \]

Excessive learning rates
As shown below, when the learning rate is set too high, the new set of parameters\(\{k_{t+1},b_{t+1}\}\)will have a higher value than the old parameter\(\{k_{t},b_{t}\}\)brings a larger estimation error (red arrow), while a good learning rate is what makes the estimation error decrease gradually
description