First of all, I'd like to introduce you to a very useful study address:/columns
Today we will combine the dish data obtained from cleaning in the first chapter and train these data using a variety of classifiers in order to build an effective model. In the process, I will explain in detail the principles of each classifier and its importance.
Although these knowledge points are not essential for practice, as third-party dependency packages have already done a lot of the encapsulation for us, making it possible to call these functions with just one line of code, it is still crucial to understand the rationale behind them. This will help us get a better grasp of how the model behaves and where to improve it in practice.
Classification roadmap
In the previous section, we've looked at Microsoft's cheat sheet and translated it into Chinese, and we hope that these were helpful. Today, we're going to continue exploring a similar cheat sheet provided by Scikit-learn, but it's much more granular and informative. This quick checklist will not only help you find relevant information quickly, but will also provide you with practical guidance when tuning your estimators (another term for classifiers).
The original address of the quick reference table:/stable/machine_learning_map.html
The expression 😭 symbol should be interpreted as "If this estimator does not achieve the expected result, follow the arrows to try the next one"
We then choose our classifier based on this roadmap:
- We have over 50 samples
- We want to predict a category
- We have tagged data.
- Our sample size is less than 100,000.
- ✨ We can choose a linear SVC
- If that doesn't work, since we have numeric data and no text data
- We can try ✨ K-nearest neighbor classifier
- If that doesn't work, try ✨ SVC and ✨ Integrated Classifier
- We can try ✨ K-nearest neighbor classifier
model building
The next step then is to divide the data into training set, test set.
import pandas as pd
from import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from import SVC
from import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
import numpy as np
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_label_df = cuisines_df['cuisine']
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
Linear SVC Classifier
According to our learning roadmap, as a first step we need to try the linear SVC classifier. Before we dive into SVC, it's necessary to understand a little bit about Support Vector Machines (SVMs). This is because when you search for SVC, you will usually find tons of information about SVMs, so it is important to understand the difference between the two.
Next, we will explore the differences between the two, as well as their respective characteristics and application scenarios, in order to lay a more solid foundation for subsequent learning.
Support Vector Machines (SVM)
Support Vector Machines are a powerful supervised learning algorithm for classification and regression problems.SVM is a broad concept that covers both classification and regression problems; while SVC is a specific application of SVM, dedicated to classification tasks. In machine learning libraries, SVC is usually the name of the classifier that implements the SVM, such as in the Scikit-learn library.
There are several subclasses of Support Vector Machines (SVMs), the main ones being: Support Vector Classification (SVC), Support Vector Regression (SVR), One-Class SVM, Multi-Class SVM, and Probabilistic SVM. These subclasses allow support vector machines to excel in different types of tasks. Each subclass has its own unique implementation and optimization for a specific application scenario.
give an example
If understanding linear SVC is a bit difficult, let's take an example:
Imagine you are in a school with two classes: a math class and an art class. Each student has different characteristics, such as their math grades and their art grades. You want to assign students to these two classes based on these two grades.
-
data point:
- Each student can be represented by a point on the graph: the X1 axis represents math scores and the X2 axis represents art scores.
- Students in math classes typically have high math scores and relatively low art scores.
- Students in art classes usually have high art scores and relatively low math scores.
-
classification objective:
You want to find a line (like a "dividing line") to separate the two classes. -
Maximum interval:
SVC looks for the best dividing line that maximizes the distance between this line and the nearest student (the support vector). This way, even if there are some new students in between in math and art scores, this line still classifies them better. -
Dealing with complex situations:
Suppose you have a student who is average in math and art. It may be difficult to clearly separate this student into a class with a single line, but SVC can do this by "adding dimensions" - imagine you have a third dimension, such as "athletic performance". In this three-dimensional space, SVC can find a more complex interface to better separate the two classes.
Through this analogy, SVC is at its core:
- It identifies an optimal way to differentiate between groups (e.g., classes).
- Ensure that the classification is more robust by maximizing the distance to the nearest student (support vector).
- In the face of complexity, other characteristics (e.g., athletic performance) can be used to help classify more accurately.
For this reason I found a picture to illustrate it and make it easier for you to understand. h3 on the other hand is the best dividing line.
Let's take a look at the code directly. Despite implementing such complex functionality, the structure of the code is relatively straightforward.
C = 10
# Creating different classifiers
classifiers = {
'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)
}
n_classifiers = len(classifiers)
for index, (name, classifier) in enumerate(()):
(X_train, (y_train))
y_pred = (X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
print(classification_report(y_test,y_pred))
The results of the run are as follows, and on the face of it, the performance is quite good, with an accuracy rate of at least close to 80%.
Accuracy (train) for Linear SVC: 78.7%
precision recall f1-score support
chinese 0.69 0.80 0.74 242
indian 0.89 0.84 0.87 239
japanese 0.73 0.71 0.72 223
korean 0.90 0.75 0.82 250
thai 0.76 0.83 0.79 245
accuracy 0.79 1199
macro avg 0.79 0.79 0.79 1199
weighted avg 0.80 0.79 0.79 1199
K-Nearest Neighbor Classifier
K-Nearest Neighbors (KNN) is a simple and intuitive supervised learning algorithm mainly used for classification and regression tasks. Its basic idea is to classify the data points to be classified into the category of its K nearest neighbors by calculating the distance between them.
give an example
In social networks, we can often see how certain users are connected to specific interest groups. For example, suppose we want to determine where a new user's interests lie, we can observe his friends around him.
When we analyze this new user's social network, we can look at the K friends closest to him. The common interests and activities of these friends can give us clues. If most of these K friends are into photography, then we can assume that the new user is probably interested in photography as well.
In this way, we use the influence of friends in social networks to determine a user's interests and hobbies to better recommend content and connections for him.
That is, it is speculated which category the current data point belongs to based on the nearest neighbor samples.
classifiers = {
'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0),
'KNN classifier': KNeighborsClassifier(C),
}
The rest of the code does not need to be modified, we just need to add a KNN classifier to the classification object. Next, we will continue to observe the results of the run. It is worth noting that this tweak did not have a significant effect, and the introduction of the KNN classifier does not seem to have improved the overall performance of the model much.
Accuracy (train) for KNN classifier: 74.1%
precision recall f1-score support
chinese 0.70 0.76 0.73 242
indian 0.88 0.78 0.83 239
japanese 0.64 0.82 0.72 223
korean 0.94 0.54 0.68 250
thai 0.68 0.82 0.74 245
accuracy 0.74 1199
macro avg 0.77 0.74 0.74 1199
weighted avg 0.77 0.74 0.74 1199
Support Vector Classifier (SVC)
Support Vector Classification (SVC) and Linear SVC are both an implementation of Support Vector Machines (SVMs), but they differ in some key aspects:
- kernel function (math.)
- SVC: A variety of kernel functions (e.g., linear kernel, polynomial kernel, radial basis function kernel, etc.) can be used for both linearly and nonlinearly differentiable datasets.
- Linear SVC: Uses only linear kernels and focuses on processing linearly differentiable data. It does not perform kernel transformations during optimization.
- suitability
- SVC: Suitable for more complex datasets, capable of capturing nonlinear decision boundaries.
- Linear SVC: Suitable for data in high-dimensional feature spaces, especially when the feature dimension is larger than the number of samples (e.g., text categorization tasks).
If you know that the data is linearly differentiable or the feature dimensions are very high, it may be more efficient to use Linear SVC. If the data has nonlinear relationships, it is more appropriate to choose SVC and use an appropriate kernel function.
classifiers = {
'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0),
'KNN classifier': KNeighborsClassifier(C),
'SVC': SVC(),
}
We will continue to observe the results of the SVC run, which seems to be performing more favorably.
Accuracy (train) for SVC: 81.7%
precision recall f1-score support
chinese 0.73 0.82 0.77 242
indian 0.89 0.90 0.89 239
japanese 0.81 0.73 0.76 223
korean 0.90 0.77 0.83 250
thai 0.79 0.86 0.82 245
accuracy 0.82 1199
macro avg 0.82 0.82 0.82 1199
weighted avg 0.82 0.82 0.82 1199
integrated classifier (math.)
Although the results of previous tests have been quite satisfactory, we have decided to follow the established route to the end in order to ensure that we can fully evaluate the performance of the model. Therefore, we will try some integrated classifiers, specifically Random Forest and AdaBoost.
random forest
A random forest is one that uses many decision trees to make judgments, ensures diversity by randomly selecting samples and features, and finally combines the results of these trees to improve overall accuracy and stability. Such an approach makes the model more reliable and adaptable to complex data.
Imagine you want to decide what to have for lunch today, but you have a lot of friends, each with different tastes. Here are a few things you can do:
- Ask multiple friends: You ask each of your friends for advice instead of listening to just one person. This way, you get a variety of different options.
- Randomly selecting friends: Instead of asking all your friends every time, you pick a few friends at random to get their opinions. This avoids being influenced by the strong preferences of individual friends.
- Remember what they said.: You make a note of each friend's suggestion, for example, a couple friends recommended pasta, a couple recommended sushi.
- final decision: At the end of the day, you choose that meal that has been recommended by the most people, based on the recommendations of your friends.
Key knowledge points
- variegationRandom forests are an idea that works by "consulting" a number of decision trees (which are like your different friends), each trained on different data and features.
- Reducing errors: With the advice of multiple "friends", the final choice (prediction) is often more accurate and less susceptible to the errors of a single decision.
- Voting mechanisms: All the predictions given by the decision tree are summarized and the "most voted" result is chosen as if your friends had voted on it.
AdaBoost
The core idea of AdaBoost is to improve the overall predictive power by combining the power of multiple weak classifiers through continuous learning and tuning. Each iteration focuses on previous errors, leading to incremental improvements in the model.
Imagine you are competing in a debate competition at school, but you are a beginner and probably not very good at debating. You decide to ask some of your classmates for advice; they are at different levels of debating, some are very good and some are just starting out.
- Ask different classmates: You have enlisted the help of several classmates, each with a different point of view and approach. Each classmate is likely to be better at something than you are.
- Learning and adaptation: The first time you practice, there may be some arguments you didn't make well enough, and these classmates will give you feedback on what you can improve. You will remember these mistakes and pay more attention the next time you practice.
- Emphasize feedback: For arguments that you got wrong before, "weight" those wrong parts. This means that in the next exercises, you will pay special attention to these issues to make sure you do better.
- Synthesis of views: At the end of each practice, you will combine the advice of all your classmates. Although each of your classmates has different abilities, by constantly adjusting and synthesizing advice, you will eventually become stronger.
Key knowledge points
- Progressive improvementsAdaBoost is just such a process. It works by having multiple "weak classifiers" (like your different classmates), each of which may not be particularly strong, but through constant tuning and learning, together they can form a stronger "strong classifier".
- Emphasize mistakes: AdaBoost pays special attention to samples that were previously incorrectly predicted, by increasing their weights to allow subsequent classifiers to pay more attention to these hard-to-categorize samples.
- weighted portfolio: In the end, the results of all the classifiers are combined, as if you had synthesized the opinions of all your classmates, and the final prediction is decided by voting or weighting.
Differences and links
Random forest: each tree is relatively independent, the model formed after combination is usually more robust, especially when the dataset is more complex, the correlation between each tree in a random forest is low, which helps to reduce the variance, and usually performs well when the number of features is very large or the dataset is complex, and is suitable for classification and regression tasks.
AdaBoost: the classifiers are serial, each new classifier depends on the results of the previous one and builds on the previous model, AdaBoost is more susceptible to noisy data as it will give higher attention to misclassified samples. It is more suitable for simple tasks that require higher accuracy, especially when faced with less noisy data.
While Random Forest focuses more on diversity and independence, AdaBoost improves overall performance by focusing on the hard parts.
The next step is two lines of code for our implementation:
# Creating different classifiers
classifiers = {
'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0),
'KNN classifier': KNeighborsClassifier(C),
'SVC': SVC(),
'RFST': RandomForestClassifier(n_estimators=100),
'ADA': AdaBoostClassifier(n_estimators=100)
}
n_estimators in a random forest means that 100 decision trees will be generated. Each tree is trained by randomly selecting samples and features. By increasing the number of trees, the model's performance is usually more stable and less prone to overfitting.
n_estimators in AdaBoost This means that 100 weak classifiers (usually simple decision trees) will be created. Each classifier will be trained based on the performance of the previous classifier, focusing on those samples that were previously misclassified. Increasing the number of classifiers usually improves the performance of the model, but may also increase the risk of overfitting.
The results of the run are as follows:
Accuracy (train) for RFST: 83.4%
precision recall f1-score support
chinese 0.78 0.80 0.79 242
indian 0.89 0.91 0.90 239
japanese 0.81 0.78 0.80 223
korean 0.89 0.81 0.85 250
thai 0.80 0.87 0.83 245
accuracy 0.83 1199
macro avg 0.83 0.83 0.83 1199
weighted avg 0.84 0.83 0.83 1199
Accuracy (train) for ADA: 69.5%
precision recall f1-score support
chinese 0.66 0.48 0.56 242
indian 0.88 0.79 0.84 239
japanese 0.66 0.64 0.65 223
korean 0.67 0.73 0.70 250
thai 0.64 0.82 0.72 245
accuracy 0.69 1199
macro avg 0.70 0.69 0.69 1199
weighted avg 0.70 0.69 0.69 1199
Up to this point, we have taken a comprehensive look at all the classifiers that can be applied throughout the roadmap. While in terms of code implementation this may only involve a single line of code or the tuning of a few parameters, behind the scenes we still need to have a general understanding of the underlying logic and principles of these models.
summarize
In this learning journey, we not only delved into the principles and applications of various types of classifiers, but also deepened our understanding of the model building process through practice. By using different algorithms such as linear SVC, K-nearest neighbor classifiers, support vector classifiers and integrated methods such as random forest and AdaBoost, we have seen the diversity of data processing and model training. At each step of the exploration, we not only focused on the accuracy, but also thought about the applicability scenarios of each algorithm and its advantages. While the use of machine learning libraries simplifies the complexity of the implementation, the logic and mechanisms behind each model are still important to understand in depth.
In addition, we learn that the performance of a model not only depends on the chosen algorithm, but is also closely related to the characteristics of the data, the quality of the preprocessing and the tuning of the parameters. In the face of different datasets, the flexibility of choosing suitable classifiers and tuning their parameters is the key to improving the effectiveness of the model. By using Scikit-learn's quick checklist, we are able to quickly locate the appropriate algorithm, which provides effective guidance for practical applications.
And of course, I'd like to end by encouraging you to check out this learning platform, which is perfect for newbies and offers a wealth of resources and easy-to-follow tutorials.
I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.
💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.
🌟 Welcome to the effortless drizzle! 🌟