Learning Machine Learning from Scratch

First of all, I'd like to introduce you to a very useful study address:/columns

In our previous study, we learned about linear regression and polynomial regression, and our topic today is logistic regression, and I remember explaining the difference between these two regressions earlier, so today we're going to look at the main features of logistic regression that we need to identify.

logistic regression

Logistic regression is mainly used to solve binary classification problems, helping us to predict whether a certain event will happen or not, such as determining whether a certain candy is chocolate or not, whether a certain disease is contagious or not, or whether a certain customer will choose a specific product or not, allowing us to transform complex data into simple "yes" or "no" results, applicable to many practical scenarios. "No" results, which is applicable to many real-world scenarios. This is usually the kind of problem that logistic regression solves.

Unlike linear regression, which focuses on predicting binary categorical outcomes, logistic regression aims to predict continuous values. For example, in linear regression, we can predict how much the price of a pumpkin might increase based on its origin, harvest time, and other relevant characteristics. However, this prediction is not absolutely certain because it relies on trends and patterns in historical data. By analyzing past price changes and the factors that influenced them, linear regression can provide a reasonable estimate

Other classifications

Of course, in addition to binary classification problems, logistic regression can be extended to multivariate classification problems. In multivariate classification, the target variable can have multiple possible fixed answers, each of which is a clear and recognizable correct option.

In addition to multivariate categorization, there is a special type of categorization problem called ordered categorization problem. In ordered categorization, we are not only concerned with the presence or absence of categories, but we also need to logically order the results, which can be very useful in certain situations.

For example, suppose we want to categorize pumpkins into different classes based on their size, such as "mini", "small" (sm), "medium" (med), "large" (lg), "extra large" (xl) and "extra large" (xxl). In this case, there is a clear sequential relationship between these categories.

Here is a separate note on sequential logistic regression, which is a generalized linear model that applies to the case where the dependent variable is ordered categorical. Simply put: set multiple thresholds\(( T_1, T_2, T_......, T_{k-1} )\)The continuous variable will be\(Y^*\)divided into k categories. Assuming that Y has k categories, the model of ordered logistic regression can be expressed as:

An example, if you don't get it: For example, a field biologist wants to study the survival time of salamanders and wants to determine if survival time correlates with area and water toxicity levels. The biologist divides survival time into three categories: less than 10 days, 11 to 30 days, and more than 30 days. Since the responses are ordinal variables, the biologist uses ordinal logistic regression.

What are dependent and independent variables

Briefly again, in case anyone doesn't understand (I've actually forgotten), the dependent and independent variables are two basic concepts in statistics and regression analysis, and are often used in modeling to understand the relationship between variables.

The independent variable, also known as the predictor or explanatory variable, is usually the input data in regression analysis. It is usually indicated using X.
The dependent variable, also known as the response variable, is the outcome variable of interest to the researcher and is usually the variable that is affected by the independent variable. The change in the dependent variable is the goal of the study. It is usually indicated using Y.

It's still linear.

Why is logistic regression still linear? Mainly because logistic regression can actually be seen as an extension of linear regression. Although its main application is to make category predictions, its underlying model still relies on linear relationships. But the model can still be improved if the linear correlation is better.

If there is a strong linear relationship between the independent and dependent variables, the model is able to capture this relationship more effectively, thus significantly improving the accuracy of the predictions. When the linear relationship is well defined, the model is able to more accurately reflect the link between the variables when fitting the data, thus reducing the prediction error.

When there is a clear linear relationship between the independent and dependent variables, the model is able to delineate the category boundaries more clearly. In two dimensions, for example, if data points of different categories are distributed on both sides of a straight line, the linear model is able to identify and determine this separating line more accurately.

Variables need not be relevant

Linear regression usually requires a degree of linear relationship between the independent and dependent variables, which is the basis of its validity. When this relationship is strong, the model is able to predict the value of the dependent variable more accurately. In addition, linear regression is also very sensitive to correlations between independent variables, especially when faced with multiple correlated independent variables, which can lead to the emergence of multicollinearity problems, thus affecting the stability and explanatory power of the model. There is a chart from our previous explanation, so you can go and take a look at it.

Logistic regression models, on the other hand, show more flexibility by allowing that the relationship between the independent and dependent variables does not have to be strictly linear. This means that the independent variables can be strongly correlated without having to be, and still influence the categorization results of the dependent variable. This flexibility allows logistic regression to adapt to many different data distributions and category boundaries, capturing more complex patterns and trends. For example, a graph like the one below:

Start practicing

We also use the previous pumpkin dataset for model training. To ensure the quality and reliability of the data, the dataset first needs to be properly cleaned. Specifically, we remove all null values to avoid the negative impact of missing data on model training.

In addition, we will select only some specific columns that are relevant to our analytical goals to simplify the dataset and improve the efficiency of the model.

import pandas as pd
import numpy as np
from import LabelEncoder

pumpkins = pd.read_csv('../data/')
()

new_columns = ['Color','Origin','Item Size','Variety','City Name','Package']
new_pumpkins = ([c for c in if c not in new_columns], axis=1)
#Deleting null values
new_pumpkins.dropna(inplace=True)
#Converting a string type variable to a numeric variable
new_pumpkins = new_pumpkins.apply(LabelEncoder().fit_transform)

new_pumpkins.info

Next, let's show the printout so that you can get a clear picture of the specifics and structure of that dataset.

<bound method  of       City Name  Package  Variety  Origin  Item Size  Color
2             1        3        4       3          3      0
3             1        3        4      17          3      0
4             1        3        4       5          2      0
5             1        3        4       5          2      0
6             1        4        4       5          3      0
...         ...      ...      ...     ...        ...    ...
1694         12        3        5       4          6      1
1695         12        3        5       4          6      1
1696         12        3        5       4          6      1
1697         12        3        5       4          6      1
1698         12        3        5       4          6      1

[991 rows x 6 columns]>

visualization

As we did when learning linear regression, we managed to identify the correlation between all the variables through heat maps and found the two fields with the highest correlation. For logistic regression, we also need to use some tools and methods instead of manually checking each value one by one. Today, we will focus on the technique of concurrent lattices.

Seaborn provides a range of clever and powerful ways to visualize data, making data analysis more intuitive and easy to understand. For example, using a side-by-side grid, we can clearly compare the distribution of data at each point.

import seaborn as sns

g = (new_pumpkins)
()

PairGrid: This is a tool provided by Seaborn to create scatterplot matrices of pairs between variables.

map: This method is used to apply a plotting function to each pair of variables in the PairGrid.

Then the final presentation is this:

You can see a lot of things in this picture, just like those two most special ones to tell you about:

It is a straight line when it is itself to itself: because the value of each data point is exactly equal to itself. That is, for any variable X, it is perfectly linear with respect to itself, forming a diagonal with a slope of one.

In the scatterplot, theColor variables to other variables (e.g.Origin, Item Size, Variety, City Name, Package) shows two rows of scatter: this is because thecoloris a finite number of categorical values, a categorical variable (e.g., "red", "blue"), and the data points corresponding to each color form different groups in the scatter plot. In this case, values of different colors may show significant differences in other variables.

Use of violin diagrams

From the above analysis, we can conclude that the color variable is a binary category, specifically, it can be classified as "orange" or "non-orange". This type of variable is referred to as "categorical data" and therefore requires a more specialized and effective approach to visualization.

In addition, there are many other techniques and tools that can be used to show the relationship between the category and other variables, such as box-and-line, bar, and violin charts.

(x="Color", y="Item Size",kind="violin", data=new_pumpkins)

For those who haven't learned about violin diagrams, let me explain, here is a standard violin diagram:

The Violin Plot is used to display the data distribution and its probability density.

Violin plots can be an efficient and attractive way to display multiple data distributions in a single pass, but keep in mind that the estimation process is affected by sample size, and violins from relatively small samples can look very smooth, and this smoothness is misleading. If that doesn't make sense, we can take height as an example. Let's see how the height distributions of, say, people from the south and north look like (the data is learning-based only)

A glance is not to see the general height of the South, and the North's very smooth basically can not see anything, so you can only rely on other means of analysis, the violin can not be used to analyze.

Okay, so to conclude, now that we understand the relationship between the binary categories of color and the larger set of dimensions, let's explore logistic regression to determine the possible colors of a given pumpkin.

math point

Remember when we were learning linear programming with that least squares method? We used to calculate the optimal direct path to minimize the distance to a straight line for all the data scatters. In logistic regression there is also a mathematical inference algorithm called the Sigmoid function that looks like an "S" shape. It accepts a value and maps it to a position between 0 and 1. Its curve is also called a "logistic curve". Its formula is shown below:

His image looks like this:

where the midpoint of the sigmoid is at point 0 of x, L is the maximum value of the curve, and k is the steepness of the curve. If the result of the function is greater than 0.5, the label in question will be assigned the binary choice class "1". Otherwise, it will be classified as "0".

Of course, there is a wealth of tools and instruments available nowadays, and we don't have to perform complex calculations ourselves at all, as various frameworks and libraries have already encapsulated these functions for us. However, it is still crucial to understand the principles behind them.

Build your model

As with linear regression, the first step in using Scikit-learn to build a model to find these binary classifications is always to split out some test and training sets from the overall data.

from sklearn.model_selection import train_test_split

Selected_features = ['Origin','Item Size','Variety','City Name','Package']

X = new_pumpkins[Selected_features]
y = new_pumpkins['Color']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Then it's time to train our model using the training set data

from sklearn.model_selection import train_test_split
from  import accuracy_score, classification_report 
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
(X_train, y_train)
predictions = (X_test)

print(classification_report(y_test, predictions))
print('Predicted labels: ', predictions)
print('Accuracy: ', accuracy_score(y_test, predictions))

Let me explain the emerging method classification_report.

classification_report is a function in Scikit-learn for evaluating the performance of a classification model. It provides some key metrics to help you understand how the model performs on the test set, and you can see that two parameters are passed, one is the actual result of y_test, and the other is the test result of our training model inferred from the test set. Bring it in for comparison. The results are displayed as follows:

              precision    recall  f1-score   support

           0       0.83      0.98      0.90       166
           1       0.00      0.00      0.00        33

    accuracy                           0.81       199
   macro avg       0.42      0.49      0.45       199
weighted avg       0.69      0.81      0.75       199

Of course there are a lot of metrics on display, so let's take our time and look at them:

Precision: Which labels are well labeled, for example: the model labels 10 photos as cats, 7 of them are real cats and 3 are dogs. Precision = real cats / (real cats + mislabeled dogs) = 7 / (7 + 3) = 0.7 (70%)

Recall: The proportion of samples that are actually positive classes that are predicted to be positive classes. In the example: Continuing with the example of detecting cats: you have a total of 15 real cat photos, and the model only found 7. Recall = real cats / (real cats + missed cats) = 7 / (7 + 8) = 7 / 15 ≈ 0.47 (47%)

F1 Score: The reconciled average of precision and recall, combining precision and recall. The best is 1 and the worst is 0

Support: The number of true samples for each class.

Accuracy: The proportion of correctly predicted samples to the total sample. Not to be confused with Accuracy. For example: Suppose you have 100 photos and the model correctly labeled 90 (including cats and dogs). Accuracy = 90 / 100 = 0.9 (90%) Why include cats and dogs? Because cats and dogs are to illustrate how well the classification model performs when dealing with multi-class problems.

While we can go through the various metric scores to look down and understand the model, it's easier to understand your model by using a confusion matrix that helps us understand the performance of the model.

confusion matrix (math.)

The "confusion matrix" (or "error matrix") is a table that represents the true and false positives and true and false negatives of a model, thus measuring the accuracy of the predictions.

In a binary classification problem, the confusion matrix usually looks like this:

	Positive	Predictive Negative
Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Look at the code results:

from  import confusion_matrix
confusion_matrix(y_test, predictions)

array([[162, 4],
[ 33, 0]])

For example, suppose our model is asked to categorize items between two binary categories, i.e., the category "pumpkin" and the category "non-pumpkin".

If your model predicts something to be a pumpkin and it actually belongs to the "pumpkin" category, we call this a true positive, shown by the number in the upper left corner.
If your model predicts that something is not a pumpkin, and it actually belongs to the "pumpkin" category, we call this a false positive, as shown by the number in the upper right corner.
If your model predicts something to be a pumpkin and it actually falls into the "not a pumpkin" category, we call this a false negative, shown by the number in the lower left corner.
If your model predicts that something is not a pumpkin, and it actually belongs to the "not a pumpkin" category, we call this a true negative, as indicated by the number in the lower right corner.

As long as we have true positives and true negatives for our model, this means better model performance. So if you want to reduce false positives, keep an eye on the recall metric.

ROC curve

The ROC curve (Receiver Operating Characteristic Curve) is used to evaluate the performance of the binary classification model, based on the graph we can quickly understand the true positive rate on the Y-axis and the false positive rate on the X-axis, let's take a look at the code results.

from  import roc_curve, roc_auc_score
import seaborn as sns

y_scores = model.predict_proba(X_test)
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
(x=[0, 1], y=[0, 1])
(x=fpr, y=tpr)

Ideal curve: ideally, the curve should be as close to the upper left corner as possible, which indicates that the model maintains a high true rate at a low false positive rate.

Random guess line: the diagonal line (from (0,0) to (1,1)) represents random guesses. If the model's ROC curve overlaps with this line, it means that the model has no discriminatory ability.

Area Under the Curve (AUC) AUC value, the area under the ROC curve, has a value between 0 and 1:

AUC = 1: perfect model.
AUC = 0.5: the model has no discriminatory power and amounts to a random guess.
AUC < 0.5: The model does not perform as well as a random guess.

Let's do the math:

auc = roc_auc_score(y_test,y_scores[:,1])
print(auc)

0.6976998904709749

By analyzing ROC curves and AUC values, you can better understand the model's classification performance, especially when faced with different prediction thresholds.

Why resort to so many visualization tools?

In fact, each tool feeds a different role, with confusion matrices visualizing predictions, classification reports providing comprehensive performance metrics, and ROC curves helping to select optimal thresholds and evaluate overall performance. The goal is to better understand the model. And then to improve the model.

Of course, we only looked at the current state of the model in this regression and did not optimize based on the results, that's what we'll talk about below. Let's leave it here for now.

summarize

After learning the basic concepts and applications of logistic regression, we can see that this method is not only capable of handling binary classification problems, but can also be extended to multivariate and ordered classification scenarios. The flexibility of logistic regression allows us to find appropriate solutions in different situations, whether it is a simple "yes" or "no" judgment, or more complex multi-level classification tasks.

It is worth noting that logistic regression is still based on linear relationships, but it has significantly different goals and application scenarios than traditional linear regression. This approach provides us with an intuitive understanding of the results through a probabilistic model that maps continuous variables between 0 and 1 using a Sigmoid function.

In addition, in the process of data processing and visualization, we use a variety of tools, such as side-by-side grids and violin plots, which not only help us to analyze the relationship between the data, but also enhance the intuition and comprehensibility of the data. Through these visualization tools, we were able to clearly identify the links between variables and the boundaries of classification.

In the hands-on session, we cleaned the data and selected features, and constructed a logistic regression model using Scikit-learn. Evaluating model performance through methods such as confusion matrices and ROC curves allowed us to gain a deeper understanding of the accuracy and predictive power of the model.

In future categorization sessions, we'll learn together how to iterate to improve the scores of our models. But for now, finish sprinkling! We're done with these regression courses!

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.

🌟 Welcome to the effortless drizzle! 🌟