Logistic regression algorithm 0 basic beginner can understand (with code)
Original link
What is a logistic regression algorithm?
Logistic Regression is a statistical model widely used for classification tasks, especially for binary classification problems. Despite the name "regression", logistic regression is primarily used for classification. The logistic regression algorithm consists of the following key components:Linear Regression and Classification, Sigmoid Functions and Decision Boundaries, Gradient Descent and Optimization, Regularization and Overfitting Mitigation
Linear regression and classification
Classification and regression problems have some similarities in that both make predictions about unknown outcomes by learning from a data set, the difference being that the output values are different.
The output values of the classification problem are discrete values (e.g., spam and normal mail).
The output value of a regression problem is a continuous value (e.g., the price of a house).
Since there is some similarity between classification and regression problems, can we classify on the basis of regression?
The classification problem can be solved using 'linear regression + thresholding', theThat is, the function based on the regression is split in two according to a threshold, but what about simply by comparing the output value of a linear fit to a certain threshold, which is a very unstable method for classification.
But when we simply use linear regression to solve a classification problem, the output is an unbounded range of continuous values. There are several problems with doing this:
1. Variable range: The output of a linear regression may be any real number (e.g. -∞ to +∞), which makes it impossible to determine a reasonable threshold for determining which category the output belongs to. For example, it is possible to get an extremely large positive or negative value that is difficult to map directly to 0 or 1.
2. Difficulty in choosing a threshold: In order to map continuous values to discrete categories (e.g. 0 and 1), we usually need a decision threshold. If the output values of linear regression do not fall within a fixed range, choosing an appropriate threshold can become very difficult, especially when dealing with extreme values.
Can we map this result to an interval of fixed size (e.g.\((0,1)\)), and thus judgment.
Sigmoid Functions and Decision Boundaries
Of course you can, that's what logistic regression does, and one of the functions used to compress the transformation on continuous values is called the Sigmoid function (also known as the Logistic function, S function).
\(S(x)=\frac{1}{1+e^{-x}}\)Outputs in\((0,1)\)Between.
Having seen the Sigmoid function above, let's talk about how combining it with linear fitting can complete the classification problem and result in a clear and interpretable "decision boundary" for the classifier. The decision boundary is the boundary at which the classifier distinguishes between samples.
So, how does logistic regression get decision boundaries and how does it relate to Sigmoid functions? The text talks about two types of decision boundaries, linear and nonlinear.
In both graphs, the one on the right is our S-function, which maps the results computed via the function\((0,1)\)intervals, then for example, what you want to do here is to bisort, look for the line at 0.5, which is the function that can divide the data into two groups in the original function, change the\(\theta_0,\theta_1,\theta_2\)The parameters such as -3,1,1 in the example of linear parameters, the function made with these parameters is the decision boundary, and 0.5 is the judgment boundary.
Gradient descent and optimization
loss function
In the previous part of the example, we took some parameters manually\(\theta_0,\theta_1,\theta_2\)The values of the parameters are taken and finally the decision boundary is obtained. But one can obviously see that different decision boundaries can be obtained when different parameters are taken.
Which decision boundary is the best? We need to define a function that can quantitatively measure how good the model is - a loss function (sometimes called an 'objective function' or 'cost function'). Our goal is to minimize the loss function. You can't just keep taking the parameters manually.
How do we measure the difference between the predicted value and the standard answer? The simplest and most direct way is the mean square error in math\(MSE=\frac{1}{m}\sum_{i=1}^{m} (f(x_i)-y_i)^2\)
Mean Square Error Loss (MSE) is widely used in loss definition and optimization for regression problems, but it is not very applicable in logistic regression problems. Because of the transformation of the sigmoid function, we end up with a loss function curve as shown in the figure below, which is very non-smooth and uneven.
We want the loss function to be convex as follows. In a convex optimization problem, the local optimal solution is also the global optimal solution
existLogistic regression modeling scenariosUnder this, we'll switch to theLogarithmic loss function (binary cross-entropy loss), this loss function is also a good measure of how good the parameters are, and yet ensures that the convex function is characterized. The formula for the logarithmic loss function is as follows:
\(J(\theta)=-\frac{1}{m} [\sum_{i=1}^{m}y^{(i)}\log{h_{\theta}(x^{(i)})-(1-y^{(i)})\log{(1-h_{\theta}(x^{(i)})})}]\)
included among these\(h_{\theta}(x^{(i)})\)represents what can be described as the predicted sample values based on this h-function.\(y^{(i)}\)The value of the sample is 1 in the case of a positive sample and 0 in the case of a negative sample. Positive and negative samples are the two categories we mentioned before, for example, normal mail is a positive sample and spam is a negative sample, so let's take a look at these two cases:
\(y^{(i)}=0\): The sample is negative if\(h_{\theta}(x^{(i)})\)is close to 1 (the predicted value is a positive sample), then\(-\log{(1-h_\theta (x))}\)The larger the value, the larger the corresponding penalty.
\(y^{(i)}=1\): The sample is positive if\(h_{\theta}(x^{(i)})\)is close to 0 (the predicted value is a negative sample), then\(-\log{(h_\theta (x))}\)The larger the value, the larger the corresponding penalty.
You can tell if the parameters are good or bad by the size of the penalty.
gradient descent
The loss function can be used to measure how good the model parameters are, but we also need some optimization methods to find the best parameters (so that the current loss function value is minimized). One of the most common algorithms is the 'gradient descent method', which iteratively reduces the loss function step by step (very easy to use in convex function scenarios). It is like descending a mountain, find the direction (slope) and take a small step at a time until you reach the bottom of the mountain.
Gradient Descent, a first-order optimization algorithm, is often also referred to as most rapid descent. To find the local minima of a function using gradient descent, an iterative search must be performed for a specified step distance of points in the opposite direction of the gradient (or approximate gradient) corresponding to the current point on the function.
Above.\(\alpha\)Known as the learning rate, the intuitive meaning is the length of each step taken as the function progresses toward the minima. Too large a rate will generally miss the minima, and too small a rate will result in too many iterations.
Further study of the gradient algorithm
Regularization and overfitting mitigation
Curve 3 is just overfitting, learned too rigidly that is, so what to do?
One way to deal with overfitting is regularization. By adding a regularization term to the loss function, we can constrain the search space of the parameters, thus ensuring that the fitted decision boundary does not jitter very much. The following figure shows the logarithmic loss function with a regularization term (here it is an L2 regularization term)
The term after the loss function is the L2 regularization term, and L2 regularization can be thought of as a penalty for Euclidean distance in the parameter space. The geometric effect is to reduce the complexity of the model by shrinking all parameters closer to the origin, penalizing larger parameter values but not reducing them to zero. This "smoothing" penalty encourages the model to equalize the weights of all features, thus reducing overfitting.
We can still optimize the loss function with the addition of the regularization term using gradient descent.
code implementation
The following uses the breast cancer dataset provided by scikit-learn as an example dataset
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from import StandardScaler
from sklearn.linear_model import LogisticRegression
from import accuracy_score, confusion_matrix, classification_report
from import load_breast_cancer
# Load Dataset
data = load_breast_cancer()
X =
y =
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Normalization of features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = (X_test)
# Initialize and train logistic regression models
model = LogisticRegression()
(X_train, y_train)
# Prediction using models
y_pred = (X_test)
# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Print Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix)
# Print Classification Report
class_report = classification_report(y_test, y_pred)
print("Classification Report:")
print(class_report)
The results are as follows
Accuracy: 0.98
Confusion Matrix:
[[ 62 1]
[ 2 106]]
Classification Report:
precision recall f1-score support
0 0.97 0.98 0.98 63
1 0.99 0.98 0.99 108
accuracy 0.98 171
macro avg 0.98 0.98 0.98 171
weighted avg 0.98 0.98 0.98 171