Learning Machine Learning from Scratch - Clustering Visualization

First of all, I'd like to introduce you to a very useful study address:/columns

In the previous section, we provided a comprehensive introduction to clustering, aiming to provide a solid theoretical foundation. Today, our main task is to delve into the techniques and methods of data visualization. In our previous studies, we have been exposed to visualization techniques in regression analysis, while today we will focus on visualization for cluster analysis. We will learn how to use visualization tools such as scatter plots, concentric circles, etc. to understand clustering results more intuitively.

Data Visualization - Clustering

Today our goal is to read and analyze data from a specific file. This file contains a large amount of song information covering multiple fields such as song title, music genre, vocalist, popularity, danceability, release date, etc. During our analysis, we will first filter out the three most prominent genres in this data and extract the relevant data. Next, we will delve into the correlation of these three genres on other fields and analyze their data distribution.

It is important to note that this chapter is not intended to discuss much about clustering algorithms and their specific roles; our focus will be on how to use visualization tools to present and understand this data. This will help us to capture trends and patterns in the data more visually, thus laying the groundwork for subsequent analysis.

Filtering data

First, we need to introduce some key dependency packages:

!pip install seaborn

import  as plt
import pandas as pd

df = pd.read_csv("../data/")
()

Next, we will take an initial look at the dataset to understand its overall structure and content.

Using the following commands, we can get a comprehensive view of the general format of the data as well as key information such as the amount of data.

()
().sum()
()

(): a quick look at the structure of the data and the types of columns.

().sum(): identifies which columns have missing data and to what extent.

(): mainly used for numerical data, provides the basic statistical properties of each column, making it easy to understand the distribution of the data.

We can start by looking at the data output from the describe method, this part of the information will provide us with important statistics and data distribution. We have discussed other related elements before, and you can refer to the attached figure for details.

Data filtering

Next, we will filter the data with the goal of extracting the top three most popular music genres. To accomplish this, we will use artist_top_genre as the x-axis to get a clearer view of the distribution of the data. Below is the corresponding code:

import seaborn as sns

top = df['artist_top_genre'].value_counts()
(figsize=(10,7))
(x=top[:5].index,y=top[:5].values)
(rotation=45)
('Top genres',color = 'blue')

As shown in the figure, we extracted the top five music genres and successfully identified three of them: afro dancehall, afropop, and nigerian pop.

Note that since we did not find any missing values (i.e., no null data) when examining the data, we decided not to delete any rows and proceeded directly to plotting. However, if you have missing values in your dataset, it is recommended that you first delete the rows containing the missing values before proceeding with the plot to ensure data integrity and graph accuracy. This will avoid potential data bias and ensure the reliability of your analysis results.

df = df[(df['artist_top_genre'] == 'afro dancehall') | (df['artist_top_genre'] == 'afropop') | (df['artist_top_genre'] == 'nigerian pop')]
df = df[(df['popularity'] > 0)]
top = df['artist_top_genre'].value_counts()
(figsize=(10,7))
(x=,y=)
(rotation=45)
('Top genres',color = 'blue')

Our data sifting has finally been successfully completed. We have now identified the top three most popular genres at the moment, with the specific information shown in the figure.

strong correlation

Similarly, let's revisit the heat map. We have already covered this part in detail in the regression analysis, so here we will provide the relevant code directly. Below is the specific implementation code:

corrmat = (numeric_only=True)
f, ax = (figsize=(12, 9))
(corrmat, vmax=.8, square=True)

Based on the analysis of the data shown in the picture, it is clear that the only variable that exhibits a strong correlation is the relationship between energy (energy) and loudness (loudness). This is not surprising, as noisy music is often accompanied by extremely high energy and a strong sense of rhythm.

Next, we will dive into a new visualization method to help us better understand the distribution of data in cluster analysis.

data distribution

concentric circles

Next, we will analyze the data based on the two metrics of popularity and danceability by drawing concentric circles and scatter plots. These charts will help us understand the distribution and trend of the data more intuitively. Of course, you can also choose other fields for comparative analysis, which can be completely adjusted according to your personal preferences and needs.

from  import LabelEncoder
le = LabelEncoder()
[:, 6:8] = [:, 6:8].apply(LabelEncoder().fit_transform)

sns.set_theme(style="ticks") 
g = (
    data=df,
    x="popularity", y="danceability", hue="artist_top_genre",
    kind="kde",
)

Due to the inconsistency in data distribution and data type, I decided to convert all data to integer format uniformly in order to ensure accuracy and consistency in the analysis. As shown in the figure:

His aim was to form a joint distribution graph to show the relationship between popularity and danceability in the dataset, while identifying different musical styles by different colors (artist_top_genre)

scatterplot

(df, hue="artist_top_genre").map(, "popularity", "danceability",s=5) .add_legend()

The scatter distribution can be observed in a single line of code, as shown in Fig:

In general, for cluster analysis, it is very effective to use scatter plots to show how the data is clustered, so mastering this visualization technique is crucial to our understanding of the structure and patterns of the data. In the next lessons, we will use k-means clustering algorithms with filtered data to explore and identify groups in the data that overlap in interesting ways.

summarize

In this chapter, we delve into the application of data visualization in cluster analysis. By analyzing the song information dataset, we successfully identified three major genres and used visualization tools such as scatter plots and concentric circles to visualize the distribution and trends of the data. The visualization not only enhanced our understanding of the data, but also laid a solid foundation for the subsequent cluster analysis.

In this way, we can not only recognize patterns in our data, but also provide strong support for decision-making. As we have seen, the process of visualizing data is an exploratory journey that helps us find hidden connections and meanings in complex data. Next, we will apply the k-means clustering algorithm to further dig into the stories behind these data.

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.

🌟 Welcome to the effortless drizzle! 🌟