Learning Machine Learning from Scratch - Linear and Polynomial Regression

First of all, I'd like to introduce you to a very useful study address:/columns

In our previous studies, we have gained an understanding of data preparation and data visualization. Today, we will dive into the concepts and applications of basic linear and polynomial regression.

If some mathematical knowledge is involved in the process, you do not have to feel intimidated, and I will give you a detailed explanation step by step so that you can better understand these contents. I hope that today's study will help you establish a clear understanding of these two regression methods and master their application in practical problems.

Linear and polynomial regression

Typically, there are two main types of regression analysis: linear regression and polynomial regression. Linear regression aims to describe the relationship between variables through a straight line, while polynomial regression allows us to use polynomial functions to more flexibly capture complex trends in the data. To help you visualize these two types of regression, we can show you through an image.

In fact, the difference between linear and polynomial regression can be boiled down to simply the difference between a straight line and a curve.

basic linear regression (BLR)

The goal of the linear regression exercise is to be able to plot an ideal regression line, so what constitutes a "perfect line"? In short, the perfect line is the one that minimizes the distance from all scattered data points to the line. Usually, we use least squares to achieve this goal.

Ideally, this sum should be as small as possible to ensure that the regression line best represents the trend in the data. Next, let's start by looking at the relevant mathematical formulas for a deeper understanding of the process.

$ sum_{i=1}^{n} (y_i - f(x_i))^2$

We want to find a set of unknown parameters that minimize the total error (i.e., distance) between the sample points and the fitted line. The most intuitive sense of this is shown below:

Because the result of subtraction can be negative, the concept of squaring is introduced here in order to ensure that the value we calculate is always positive. Whether the result of the subtraction is positive or negative, after squaring, all results will be converted to positive numbers.

This green line is known as the line of best fit and can be represented by a mathematical equation:

Y = a + bX

X is the "explanatory variable" and Y is the "dependent variable". The slope of the line is b, and a is the y-axis intercept, which refers to the value of Y when X = 0.

A good linear regression model will be one that has a high (closer to 1) correlation coefficient obtained using least squares regression with straight line regression. The correlation coefficient (also known as the Pearson correlation coefficient) I'll explain:

We can find that the correlation coefficients reflect the linear relationship between the variables and the direction of the correlation (first row), not the slope of the correlation (middle) or the various nonlinear relationships (third row). Note that the slope is 0 in the middle plot, but the correlation coefficient is meaningless because the variable is 0 at this point.

Tool analysis

Of course, manual computation of metrics like correlation is not practical, especially when dealing with large-scale data, and manual processing is not only inefficient but also prone to errors. Therefore, utilizing existing frameworks, such as Scikit-learn, enables more efficient and accurate correlation analysis, allowing us to focus on more important tasks.

pip install scikit-learn

import pandas as pd
import  as plt
import numpy as np
from datetime import datetime
from  import LabelEncoder

pumpkins = pd.read_csv('../data/')

()

After obtaining the sample data and analyzing it, it is usually necessary to filter out the fields that we need, while those that are not needed should be discarded for further in-depth analysis.

pumpkins = pumpkins[pumpkins['Package'].('bushel', case=True, regex=True)]

columns_to_select = ['Package', 'Variety', 'City Name', 'Low Price', 'High Price', 'Date']
pumpkins = [:, columns_to_select]

price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2

month = (pumpkins['Date']).month
day_of_year = pd.to_datetime(pumpkins['Date']).apply(lambda dt: (dt-datetime(,1,1)).days)

new_pumpkins = (
    {'Month': month, 
     'DayOfYear' : day_of_year, 
     'Variety': pumpkins['Variety'], 
     'City': pumpkins['City Name'], 
     'Package': pumpkins['Package'], 
     'Low Price': pumpkins['Low Price'],
     'High Price': pumpkins['High Price'], 
     'Price': price})

new_pumpkins.loc[new_pumpkins['Package'].('1 1/9'), 'Price'] = price/1.1
new_pumpkins.loc[new_pumpkins['Package'].('1/2'), 'Price'] = price*2
new_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:, 0:-1].apply(LabelEncoder().fit_transform)

new_pumpkins.head()

Although the code may be confusing, the core logic is actually quite simple: we just need to extract the required fields, including year, price, city and packaging information. If you're not familiar with programming, that's perfectly fine - there are many code assistants available, and it's easy to write the code you need with any of them. The key is to keep a clear logical train of thought and not get distracted by the complexity of the code.

import  as plt
('City','Price',data=new_pumpkins)

Normally, the city field would be text aka the city name, but for computational reasons, here's a snippet of codenew_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:, 0:-1].apply(LabelEncoder().fit_transform)Convert categorical variables in the dataset to numerical variables for subsequent machine learning modeling.

The rendering shown above demonstrates the results of our initial analysis. Of course, you can choose any of the fields for the price take point analysis, but it is often difficult to find the correlation between the variables by visual observation. However, this is not a problem as there are many powerful tools available to help us perform in-depth data analysis.

print(new_pumpkins['City'].corr(new_pumpkins['Price']))
# 0.32363971816089226

Indeed, the whole process is straightforward and can be summarized in a single sentence. Next, it is simply a matter of leaving the remaining analytical tasks to the framework. Here, we observe that the correlation is only about 0.3, showing a fairly significant distance from 1, which means that the relationship between the variables is not strong.

Therefore, we can try to replace another field to analyze the price to further explore the correlation between different variables.

print(new_pumpkins['Package'].corr(new_pumpkins['Price']))
# 0.6061712937226021

As we are well aware, there are many factors that affect price, so it seems cumbersome and inefficient to individually try to replace the fields and run the code each time to determine which field has the highest correlation with price. In this case, we can use a more efficient method - constructing a heat map.

corr = poly_pumpkins.corr()
.background_gradient(cmap='coolwarm')

The code is actually very simple, requiring only two lines to accomplish our goal. Next, let's look at the results of the run together to better understand the effect and output of this code.

Based on these correlation coefficients, you can visualize a good correlation between Package and Price.

regression model

So far, we can tentatively confirm that the price of the packaging method is more correlated with the price of the urban area distribution than with the price of the urban area distribution. This finding sets the stage for our analysis, so next, we will create a regression model centered on packaging method.

new_columns = ['Package', 'Price']
lin_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
X = lin_pumpkins.values[:, :1]
y = lin_pumpkins.values[:, 1:2]

Again, we just need the package and price fields. Let's take the first column as X-axis data and the second column as Y-axis data.

Next, the construction of the regression model begins, much like in Section 1, still drawing the test set as well as the training set from the sample total, and using Python's scikit-learn library to train a linear regression model and make predictions on the test set, and the code is written again:

from sklearn.linear_model import LinearRegression
from  import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)

pred = lin_reg.predict(X_test)

accuracy_score = lin_reg.score(X_train,y_train)
print('Model Accuracy: ', accuracy_score)
# Model Accuracy:  0.3315342327998987

Let me explain briefly:

Divide the dataset into a training set and a test set
Create a linear regression model and train it, and yes, just two lines of code did the training, without even having to write anything ourselves.
Prediction of the test set using the trained model
Calculate and print model accuracy

The final score really isn't very high, and the correlation isn't very good after all.

We can visualize the trained model to show its performance and predictions more intuitively.

(X_test, y_test,  color='black')
(X_test, pred, color='blue', linewidth=3)

('Package')
('Price')

()

In this way, we have essentially succeeded in implementing a basic linear regression model.

polynomial regression (math.)

Another important type of linear regression is polynomial regression. While in many cases we observe a direct linear relationship between the variables - for example, the larger the size of the pumpkin, the higher the price usually goes - in some cases the relationship may not be simply represented by a plane or a straight line.

As we can see in the illustration above, if our model uses curves rather than straight lines, this may more effectively fit the distribution of data points and thus more accurately capture the complex relationships between variables.

new_columns = ['Variety', 'Package', 'City', 'Month', 'Price']
poly_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')

Let's take a fresh look at the data and then just pull out the Package and Price values.

X=poly_pumpkins.iloc[:,3:4].values
y=poly_pumpkins.iloc[:,4:5].values

Next, we will proceed directly to model training. In this process, we use another API, the scikit-learn library, to build a pipeline containing polynomial feature transformations and linear regression models. This pipeline is designed to simplify the data processing flow, allowing polynomial feature generation and model training to be efficiently chained together.

from  import PolynomialFeatures
from  import make_pipeline

pipeline = make_pipeline(PolynomialFeatures(4), LinearRegression())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

((X_train), y_train)

y_pred=(X_test)

PolynomialFeatures

Here, I would like to briefly introduce the feature PolynomialFeatures, a tool in the scikit-learn library for generating polynomial features.

Its main purpose is to convert the input features into polynomial form, thus allowing us to take into account non-linear relationships between variables in our model. For example, when we have a single feature, using PolynomialFeatures allows us to create squares, cubes, or even higher powers of that feature.

The parameter in our code is set to 4, meaning that we want to perform a fourth degree polynomial transformation of the input features X. This transformation will allow each original feature to generate a combination of its fourth power, thus enriching our feature set.

To help you understand the process more intuitively, I will provide a set of example data to show what kind of polynomial features the original X-feature data has been transformed into after PolynomialFeatures processing.

import numpy as np

X = ([1, 2, 3, 4, 5])
X = (-1, 1)
X

This is our current feature set X, containing a simple set of data points.

array([[1],
       [2],
       [3],
       [4],
       [5]])

pp = PolynomialFeatures(4)
X_poly = pp.fit_transform(X)
print(X_poly)

The conversion will generate the following:

[[  1.   1.   1.   1.   1.]
 [  1.   2.   4.   8.  16.]
 [  1.   3.   9.  27.  81.]
 [  1.   4.  16.  64. 256.]
 [  1.   5.  25. 125. 625.]]

The significance of this is that linear regression assumes that the relationship between the features and the target is linear. With polynomial feature transformation, more complex nonlinear relationships can be captured. This helps to improve the fit of the model.

During training, the model is actually learning a nonlinear function, as an example e.g. a quadratic polynomial$y = ax^2 + bx + c$ represents a parabola.

Then this model is trained to produce the following:

df = ({'x': X_test[:,0], 'y': y_pred[:,0]})
df.sort_values(by='x',inplace = True)
points = (df).to_numpy()

(points[:, 0], points[:, 1],color="blue", linewidth=3)
('Package')
('Price')
(X,y, color="black")
()

Next, we can just make predictions based on this nonlinear model. A brief demonstration:

( ([ [2.75] ]) )
# array([[46.34509342]])

summarize

In our journey exploring linear and polynomial regression, we not only learn how to construct models, but also understand the math behind them and application scenarios. Through the step-by-step guide, we hope that we have gained a deeper understanding of the complexity of data analysis. In the future, the flexible application of these regression techniques in real-world problems will give us an edge in data-driven decision-making.

Whether it is choosing the right regression method or utilizing powerful tools, we are able to distill valuable insights from data more effectively. Next, we look forward to exploring and digging deeper in practice, and ultimately realizing a deep understanding and application of data.

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.

🌟 Welcome to the effortless drizzle! 🌟