Learning Machine Learning from Scratch

First of all, I'd like to introduce you to a very useful study address:/columns

In the previous section, we focused on visual analytics for clustering to help us better understand the relationships and structure of our data. Today, we're going to get right into the real-world application of using k-means, a classic method in clustering algorithms, to train and predict data. Okay, let's get right to it.

build a model

Before proceeding to data cleaning, we first review the core concepts of the K-means clustering algorithm.The main goal of K-means clustering is to optimize the center of mass through continuous iteration, so that samples within the same cluster are more similar to each other, while the differences between samples in different clusters are significantly increased, thus achieving effective clustering results. However, this algorithm also has obvious drawbacks: first, it is extremely sensitive to outliers, which may have a large impact on the calculation of the center of mass; second, the K-means algorithm needs to set the number of K centers of mass in advance, and this predetermined K value will directly affect the clustering effect of the model.

Despite these challenges, fortunately, we have methods that can help us better analyze and select suitable K-values. Next, we will start cleaning the data to prepare it for the application of the K-means clustering algorithm.

Data preparation

First, we need to clean the data by removing those unnecessary fields and features that contain a lot of outliers. This is because during the K-means training process, useless features and outliers can interfere with the effectiveness of the model and affect the accuracy and validity of the clustering. For this purpose, we will use box plot analysis, which is an intuitive and effective tool to help us identify and deal with outliers.

box diagram

The box plot consists of five important numerical points, namely the minimum observed value (lower edge), the 25% quantile (Q1), the median, the 75% quantile (Q3), and the maximum observed value (upper edge). These statistical indicators can effectively summarize the distributional characteristics of the data and help us to understand the concentration trend and the degree of dispersion of the data.

When analyzing data, if outliers, or outliers, exist, their values will be outside the range of the maximum or minimum observed value. In a box plot, these outliers are usually presented as "dots", which are easy to recognize and handle visually. These outliers require special attention because they may have a negative impact on the subsequent K-means clustering analysis. As for other numerical points in the box plot, such as quartile and median, we can not pay too much attention to them for the time being, and focus on identifying and dealing with these outliers to ensure the quality of the data and the effectiveness of the cluster analysis.

Data Cleaning

Before you start cleansing your data, first make sure that you have installed all the relevant dependency packages for a smooth follow-up.

pip install seaborn scikit-learn

import  as plt
import pandas as pd
import seaborn as sns
from  import LabelEncoder

df = pd.read_csv("../data/")
df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
df = df[(df['popularity'] > 0)]

This code was written mainly based on the results of our analysis in the previous section, which found that the first three genres had the largest amount of data. As a result, we decided to keep the data related to these three genres directly, and no longer keep the data from the other genres, in order to focus more on the subsequent analysis and processing.

(figsize=(20,20), dpi=200)

(4,3,1)
(x = 'popularity', data = df)

(4,3,2)
(x = 'acousticness', data = df)

(4,3,3)
(x = 'energy', data = df)

(4,3,4)
(x = 'instrumentalness', data = df)

(4,3,5)
(x = 'liveness', data = df)

(4,3,6)
(x = 'loudness', data = df)

(4,3,7)
(x = 'speechiness', data = df)

(4,3,8)
(x = 'tempo', data = df)

(4,3,9)
(x = 'time_signature', data = df)

(4,3,10)
(x = 'danceability', data = df)

(4,3,11)
(x = 'length', data = df)

(4,3,12)
(x = 'release_date', data = df)

We can directly draw box plots based on the data in each column, which can effectively show the distribution of data and outliers. As shown in Fig:

Next, we will remove the box plots that show outliers in order to better focus on the main trends and characteristics of the data. Ultimately, we will keep only the box plots shown below:

Next, we all know that numerical features play a crucial role in model training, so it is necessary to transform these features appropriately. We have previously discussed the conversion methods, so we won't go into details here. Below is the code that implements this transformation:

from  import LabelEncoder
le = LabelEncoder()
X = [:, ('artist_top_genre','popularity','danceability','acousticness','loudness','energy')]
y = df['artist_top_genre']
X['artist_top_genre'] = le.fit_transform(X['artist_top_genre'])
y = (y)

KMeans clustering

As the data originated from the observations and analyses in the previous section, it is not clear from the data itself that there are three different schools of thought among them in terms of specific data characteristics, despite our initial scrutiny of them. Therefore, in order to determine the optimal number of centers of mass, we need to perform an in-depth analysis with the help of elbow diagrams in order to find the most appropriate clustering settings.

Elbow Diagram

Elbow Method is a commonly used technique to determine the number of clusters (K) in K-Means clustering. The method helps us to find an appropriate number of clusters by analyzing the effect of clustering at different K values. Its advantage is that it is intuitive and easy to understand and can effectively guide the selection of the number of clusters.

Typically, the total squared error (SSE) decreases as the K value increases, but after a particular K value, the magnitude of the error reduction decreases significantly. This turning point is known as the "elbow", and it marks the point at which the benefits of increasing K values diminish, thus helping us to identify the optimal number of clusters. Next, we will plot the elbow in order to visualize this process.

from  import KMeans
wcss = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    (X)
    (kmeans.inertia_)
(figsize=(10,5))
(x=range(1, 11), y=wcss,marker='o',color='red')
('Elbow')
('Number of clusters')
('WCSS')
()

Next, I will give a brief explanation of this code. Its main purpose is to compute and store the Within-Cluster Sum of Squares (WCSS, Within-Cluster Sum of Squares) corresponding to each K by iterating over different values of K in order to subsequently draw elbow diagrams to help select the optimal number of clusters.

Set up a loop from 1 to 10 (containing 1, but not 11), i.e. test 1 to 10 clusters.
init = 'k-means++': use K-means++ initialization method to improve the quality of clustering results.
random_state = 42: Set a random seed to ensure that the results are reproducible for each run. Otherwise, even with the same data, the results will be different for each run.
Add the inertia_ attribute of the current model (which represents the sum of squares within clusters) to the wcss list. inertia_ is an attribute of the KMeans class that represents the sum of squares of the distances within all clusters; smaller means better clustering.

After successfully plotting the elbow, as shown in the figure, we can clearly observe the trend of the WCSS as the K value changes. By analyzing this graph, it is evident that the reduction in error decreases significantly at a value of K of 3, creating a clear turning point. This turning point suggests that choosing a K of 3 is optimal because after this point, the improvement in the clustering effect by increasing the number of clusters is no longer significant.

training model

Next, we will apply the K-Means clustering algorithm and set the number of centers of mass to 3 to evaluate the accuracy of the model and the clustering effect. Below is the code that implements this process:

from  import KMeans
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
(X)
labels = kmeans.labels_
correct_labels = sum(y == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, ))
print('Accuracy score: {0:0.2f}'. format(correct_labels/float()))

Result: 105 out of 286 samples were correctly labeled.
Accuracy score: 0.37

Disappointingly, the models were surprisingly less accurate than random guesses. This result prompted us to review the box plots we had previously analyzed, and in addition to identifying a number of features that were clearly out of the ordinary, we noticed that many of the retained features also had outliers. These outliers would undoubtedly have a negative impact on the results of the K-Means clustering, so we decided to normalize these features to improve the stability and accuracy of the model:

from  import StandardScaler
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
(X_scaled)
labels = kmeans.labels_
correct_labels = sum(y == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, ))
print('Accuracy score: {0:0.2f}'. format(correct_labels/float()))

Result: 163 out of 286 samples were correctly labeled.
Accuracy score: 0.57

When using K-Means clustering, data normalization using StandardScaler can often significantly improve the clustering results. This is mainly because StandardScaler adjusts the mean of each feature to 0 and the standard deviation to 1, so that all features are compared on the same scale. This process effectively eliminates the effect of scale differences between different features, and avoids the situation where some features dominate the distance calculation due to their large range of values.

Therefore, through standardization, we are able to more fairly evaluate the contribution of each feature to the clustering results, thus improving the overall performance and accuracy of the K-Means algorithm.

summarize

In this paper, we delve into the K-means clustering algorithm and its application in data analysis, especially how to effectively clean and prepare data to improve clustering results. By utilizing box plots, we identified and dealt with outliers, laying a solid foundation for subsequent cluster analysis. In determining the number of suitable centers of mass, we applied the elbow rule and successfully found the optimal K value.

Although the accuracy of the initial model was not ideal, by normalizing the data, we significantly improved the clustering with an accuracy of 57%. This process not only demonstrates the basic principles of K-means clustering, but also emphasizes the importance of data preprocessing. Clear data not only improves the reliability of the model, but also provides more meaningful insights for data analysis.

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.