Learning Machine Learning from Scratch - A First Look at Classifiers

First of all, I'd like to introduce you to a very useful study address:/columns

In the previous section, we have completed the preparation of all the equalized and cleaned dish data. Next, we will consider the use of multiple algorithms, i.e., different classifiers, to build the model. We will use this dataset and various classifiers to predict which country's cuisine the dishes belong to, based on a specific set of ingredients.

Along the way, you'll learn in-depth how to evaluate and weigh the strengths and weaknesses of different classification algorithms and how to choose the most appropriate model for the task at hand.

Choose your classifier

Scikit-learn categorizes classification tasks as part of supervised learning, which includes a variety of algorithms and methods that can be used for classification. At first glance, the choices seem dizzying. Here are some of the main algorithms that can be used for classification:

Linear Models: These models are based on the assumption of linearity, and classification is performed by a linear combination of features.
Support Vector Machines (SVMs): This algorithm maximizes the spacing between categories by finding the optimal separating hyperplane for classification.
Stochastic Gradient Descent (Stochastic Gradient Descent): An efficient optimization method that can be used to train various models, especially when dealing with large-scale datasets.
Nearest Neighbors (Nearest Neighbors): An example-based learning method that classifies samples by calculating the distance between them is simple and effective.
Gaussian Processes (Gaussian Processes): A flexible nonparametric Bayesian approach that captures and categorizes the underlying distribution of data.
Decision Trees: Decisions are made by building a tree structure that progressively divides the data according to the different values of the features.
Ensemble methods - voting classifier: Combining predictions from multiple classifiers to improve overall classification performance.
Multiclass and multioutput algorithms (Multiclass and multioutput algorithms): Handles multi-category classification and multi-label classification problems and is able to output predictions for multiple categories simultaneously.

How to choose a classifier?

Instead of engaging in aimless speculation, download this exhaustive machine learning crash course-Microsoft's original version of the cheat sheet. This quick checklist systematically compares and summarizes various algorithms and can effectively guide us in making informed decisions in selecting suitable classification algorithms. Based on this quick checklist, we can consider the following algorithm choices for the multiple classification tasks covered in this chapter:

The following is my own translated Chinese version, hope it can help you, as shown in the picture:

Analytical model selection

First, our predictions are fixed and involve multi-class classification, so we can focus on the selection of multi-class classification models. Next, we will analyze the suitable algorithms one by one.

Neural Networks (NNs), while powerful, seem overly complex for this particular task. Our data is small and relatively clear.

We can consider two algorithms, decision tree and logistic regression. Decision tree is an intuitive and easy to interpret model for dealing with multi-class classification problems. It automatically selects important features and performs data segmentation to visualize the decision-making process. However, logistic regression is also a very practical choice, especially when dealing with multi-type data, as it can effectively establish linear relationships between classes.

Although augmented decision trees are very effective in some scenarios, they are more suitable for non-parametric tasks, especially when sorting or integrating multiple models is required. For our current task, on the other hand, augmented decision trees do not provide direct help.

Taking these factors into consideration, we decided to choose logistic regression as our model building method.

Since we are already relatively familiar with the Scikit-learn library, today we will focus on analyzing the parameter settings of the LogisticRegression method, since different parameter configurations directly affect the underlying algorithm's operating mechanism. Through a deeper understanding of these parameters, we can better optimize the model to achieve higher classification accuracy.

LogisticRegression Explained

When we use Scikit-learn for logistic regression operations, the two crucial parameters aremulti_class cap (a poem)solver, they have a direct impact on the performance and applicability of the model and therefore require special attention and elaboration.

parameters	descriptive	selectable value	note
`multi_class`	Specify the categorization method.	- `ovr`(one to many) - `multinomial`(polynomial)	- `ovr` is the default option for bicategories and multicategories - `multinomial` Suitable for multiple classifications and usually requires softmax output.
`solver`	Optimization algorithm selection for solving logistic regression models with weights.	- `liblinear` - `newton-cg` - `lbfgs` - `sag` - `saga`	- `liblinear` only applicable to`ovr` - `newton-cg`、`lbfgs`、`sag`、`saga` can be combined with`multinomial` Use in conjunction.

particular

multi_class parameters：
- ovr(one to many): Convert a multi-class problem into multiple binary classification problems. For each class, train a classifier to distinguish that class from all other classes.
- multinomial(polynomial): For multi-category classification problems, probability estimation using the softmax function usually works better when the number of categories is large.
solver parameters：
- liblinear: Applies to small data sets and binary classification problems, supportsovr。
- newton-cg: Apply to multi-category issues and supportmultinomial。
- lbfgs: Apply to multi-category issues and supportmultinomial, which usually works better on large data sets.
- sag: For large data sets, support formultinomial, accelerated convergence.
- saga: For large data sets, support formultinomial, supports both L1 and L2 regularization.

(math.) a match

liblinear collaboration withovr Compatible.
newton-cg、lbfgs、sag、saga Both can be used withmultinomial cap (a poem)ovr Compatible.

We already know a little bit about polynomial logistic regression, so let's explain in more detail what "ovr" (one-to-many) means and how it works in multi-class categorization.

ovr in detail

"ovr" originally means One-vs-Rest (OvR) strategy, which involves splitting a multi-class dataset into multiple binary classification problems. Specifically, this approach allows us to utilize existing binary classification algorithms to solve more complex multi-class classification problems by treating each class as a separate binary classification task. For each binary classification problem, we will train a binary classifier and use the most reliable model among these classifiers for final prediction.

For example, consider a multi-class classification problem with three categories: "red", "blue" and "green". In this case, we can split this multi-class classification problem into the following three binary categorization datasets:

Binary classification problems 1: Determine whether a sample is "red", to distinguish it from the other two categories ("blue" and "green").
Binary classification problems 2: Determine whether a sample is "blue", to distinguish it from the other two categories ("red" and "green").
Binary classification problems 3: Determine whether a sample is "green", to distinguish it from the other two categories ("red" and "blue").

This split allows each classifier to focus on only one specific class, thus simplifying the process of model training. Since logistic regression itself is designed for binary classification, it cannot be directly applied to multi-class classification tasks. However, by using the OvR strategy, we can adopt a heuristic approach to decompose the multi-class classification problem into multiple binary classification datasets and train a separate binary classification model for each dataset.

parsing data

As a result of the above analysis, our model selection scheme can be either "ovr" (one-vs-rest, One-vs-Rest) or "multinomial" (multinomial). Since logistic regression was originally designed to be used in binary classification tasks, both of these scheme parameter choices are effective in enabling logistic regression to perform well in multiclass classification tasks.

In our model, we have decided to includemulti_class parameter is set to "ovr", and select thesolver Set to "liblinear" for model training. The reason for choosing "liblinear" as the solver is that it performs well with small datasets, especially in binary classification scenarios, providing fast convergence and high accuracy.

Split data

Next we clean down the data and perform test, training set partitioning.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from  import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from  import SVC
import numpy as np

cuisines_df = pd.read_csv("../../data/cleaned_cuisines.csv")

cuisines_label_df = cuisines_df['cuisine']

X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)

The overall structure and logic of this code we already have a fairly clear impression, so we will not explain its implementation in detail here.

build a model

Create a logistic regression model

lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = (X_train, (y_train))

accuracy = (X_test, y_test)

Accuracy is 0.8015012510425354

After the operation was completed, we observed that the model was as accurate as 80%!

Next, we will examine the results of this round of predictions in detail to assess the specific sources of accuracy.

  test= X_test.iloc[50].(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = (data=proba, columns=classes)

topPrediction = .sort_values(by=[0], ascending = [False])
()

The output from the run is as follows - as you can see, based on the model's predictions, this dish is most likely to be recognized as Japanese.

japanese 0.935524
indian 0.040175
korean 0.016106
thai 0.005825
chinese 0.002370

Similar to the analysis you performed in the previous regression section, we can likewise generate a classification report to get more detailed information about the model.

y_pred = (X_test)
print(classification_report(y_test,y_pred))

A detailed report is provided below:

              precision    recall  f1-score   support

     chinese       0.73      0.71      0.72       238
      indian       0.92      0.87      0.89       243
    japanese       0.79      0.79      0.79       237
      korean       0.84      0.81      0.82       236
        thai       0.75      0.84      0.79       245

    accuracy                           0.80      1199
   macro avg       0.80      0.80      0.80      1199
weighted avg       0.80      0.80      0.80      1199

At this point, the process of constructing this model has been successfully completed!

summarize

Although our parsing of the data today still utilizes the same familiar methods as before, we added two important new knowledge points to the mix. First, we delved into classifier selection, analyzing the applicability and performance of different classifiers in specific tasks. This not only helps us better understand the advantages and limitations of various types of algorithms, but also provides theoretical support for future model selection.

Secondly, we analyze the parameter settings in logistic regression in detail, especially the "ovr" (one-to-many) strategy is studied in depth. Through the understanding of this strategy, we can more clearly grasp how to effectively apply logistic regression in multi-class classification problems, and improve the accuracy and generalization ability of the model.

Next, we will continue to learn more about the characteristics and applications of various classifiers to improve overall classification performance.

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.

🌟 Welcome to the effortless drizzle! 🌟