Learning Machine Learning from Scratch - Understanding Classification Algorithms

classification algorithm

First of all, I'd like to introduce you to a very useful study address:/columns

Classification algorithms are an important approach to supervised learning, and they are similar to regression algorithms in many ways. The core goal of supervised learning is to make predictions using existing data sets, either numerical or categorical. Specifically, classification algorithms are used to categorize input data into different classes, which can usually be divided into two main categories: binary classification and multivariate classification.

Understanding this process is not really complicated, for example, how to separate a series of emails into normal emails and spam emails. This principle can also be seen in our lives, for example in the use of sorting bins. Whether it's sorting mail or dealing with spam, the end result is a clear qualitative distinction between different types of content.

Although this problem seems relatively simple, I have a question: logistic regression was explained in our previous study of regression algorithms. Why is logistic regression categorized as a classification algorithm and not a regression algorithm?

Logistic regression: regression vs. classification

Despite the word "regression" in its name, logistic regression actually performs classification tasks. Therefore, it is categorized as a classification algorithm.

We are explaining logistic regression, is there to explain to a mathematical knowledge point, a Sigmoid function, and then review, this function has a few characteristics, sigmoid can be compressed to the data between [0, 1], which passes through an important point (0, 0.5). This compresses the output to between [0,1], with 0.5 as the threshold, greater than 0.5 as one category and less than 0.5 as another.

Of course logistic regression usually has multivariate classification as well. We won't go into that here.

Today, we are going to go through an official example that explains this in detail. This example involves analyzing a set of raw materials to determine which country's dishes they can be used to make. Obviously, this is a multiclassification problem, as there is a very large variety of countries involved.

Prepare data

Next, we will focus on today's main task, data preparation. After the previous chapters on regression, we have mastered the basic steps of data cleansing, including the three main aspects of reading file data, deleting redundant fields, and removing null data.

Today, we will still follow these steps, but our goal will be to focus on the balance of the data. This is because in actual datasets, some categories may be over-sampled while others may be under-sampled, and this imbalance can significantly bias subsequent predictive categorization.

Therefore, we will pay special attention to how the data can be reasonably balanced to improve the predictive performance of the model.

Import data

We will continue to use the official example data provided for import.

import pandas as pd
import  as plt
import matplotlib as mpl
import numpy as np
from imblearn.over_sampling import SMOTE

df  = pd.read_csv('../data/')
()

Here, we show several sample rows of data where 0 and 1 are used to indicate which ingredients were used in a particular dish. Specifically, 0 indicates that the ingredient was not used, while 1 indicates that the ingredient was used. Since the number of ingredients included in the dataset is approximately 384, the middle portion of the data is shown collapsed for brevity.

Next, we will demonstrate the imbalances that exist in the dataset. Since our main focus is on the probability of occurrence of dishes in different countries, it is crucial to ensure that the data for each country is as consistent as possible. Next, let's view the current distribution of data.

.value_counts().()

Since the country data are stored in the "cuisine" column, we only need to count the values in this column, as shown in the figure:

Removing data

Because we need to keep all the columns of data, each of which is significant to our analysis, we need to remove some useless row data before we can perform data balancing. Why are some rows of data useless?

Rice is used as a raw material in some dishes, but since rice is not effective in distinguishing between different dishes in the data, each region uses rice as a basic raw material, making it difficult to accurately predict the country to which the dish belongs. Therefore, in order to improve the prediction accuracy of the model, we need to remove these confusing rows of data first.

After that, the data for each country is then checked for balance to ensure that the model is better able to learn and differentiate between dishes from different countries.

def create_ingredient_df(df):
    ingredient_df = (['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
    ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
    ingredient_df = ingredient_df.sort_values(by='value', ascending=False,
    inplace=False)
    return ingredient_df

To this end, a separate function is encapsulated, then I will explain the design and implementation of this function in detail.

The main purpose of this function is to receive a DataFrame df as an input parameter and process that DataFrame to create a new DataFrame ingredient_df containing the frequency of occurrence of each type of ingredient.

In the specific business logic, the function's processing is more complex. First, it transforms the input rows and columns to remove those useless country and serial number information. Next, the function performs a summation operation on each remaining row of ingredients to calculate the total frequency of occurrence of each ingredient. Finally, it filters out the ingredients with a non-zero frequency, sorts them by frequency, and finally returns the processed DataFrame ingredient_df.

Next, we need to examine each country's dishes one by one, identifying duplicate lines that use the same ingredients and that also use those ingredients relatively frequently.

thai_ingredient_df = create_ingredient_df(thai_df)
thai_ingredient_df.head(10).()

japanese_ingredient_df = create_ingredient_df(japanese_df)
japanese_ingredient_df.head(10).()

chinese_ingredient_df = create_ingredient_df(chinese_df)
chinese_ingredient_df.head(10).()

indian_ingredient_df = create_ingredient_df(indian_df)
indian_ingredient_df.head(10).()

korean_ingredient_df = create_ingredient_df(korean_df)
korean_ingredient_df.head(10).()

In this way, we are able to visualize and analyze the data. For ease of understanding and comparison, we will focus on two charts, while the others should be viewed and analyzed on your own.

A ranking of the number of ingredients used in Chinese cuisine:

Ranking the number of ingredients used in Indian dishes:

During the data processing, we noticed that "garlic" and "ginger" appeared frequently in the dataset. Therefore, in order to avoid the interference of these high-frequency ingredients on the analysis results, we decided to remove the dishes containing these ingredients.

In addition, through comparative analysis of dishes from other countries, we found that "rice" was also a highly recurring ingredient. In order to ensure the accuracy and diversity of the analysis, we finally decided to remove dishes containing these three ingredients (garlic, ginger, and rice) from the dataset.

feature_df= (['cuisine','Unnamed: 0','rice','garlic','ginger'], axis=1)
labels_df =  #.unique()
feature_df.head()

Next, we can look at the current distribution of data and see that it is still unbalanced.

Balancing data

Here is a new method - SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE is a technique commonly used in machine learning, especially when dealing with classification problems, when the number of samples of one category in a dataset is much less than the other categories. In other words it is just to fake data based on the characteristics of the input data. The purpose is to solve the data imbalance problem.

oversample = SMOTE()
transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)

After building the data, merge the city label data with the ingredient data:

transformed_df = ([transformed_label_df,transformed_feature_df],axis=1, join='outer')

We end up with a more balanced distribution of data, which has some benefits for machine training:

transformed_df.cuisine.value_counts().()

The data preparation is all but complete, and we will then delve into how to use this data to build an effective model in the next section.

summarize

Classification algorithms play a crucial role in the fields of data science and machine learning, not only helping us to extract meaningful information from complex data, but also enabling us to make more accurate decisions in real-world applications.

By exploring the text in depth, we not only understood the core logic of the classification algorithm, but also effectively cleaned and equalized the data through a series of systematic operations, thus laying a solid foundation for subsequent analysis.

In the next step, we will build and optimize the model based on the cleaned data in this chapter.

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.

🌟 Welcome to the effortless drizzle! 🌟