Learning Machine Learning from Scratch - Understanding Clustering

First of all, I'd like to introduce you to a very useful study address:/columns

Clustering is an unsupervised learning method based on the assumption that the dataset is unlabeled or that there is no direct correspondence between the input data and the predefined output. The main goal of clustering is to group data points with similar characteristics into the same set, which is often referred to as a "cluster".

The quality and validity of clustering results often depend on the distance measure between data points, which in turn affects the accuracy and rationality of grouping. Through clustering analysis, researchers are able to discover the intrinsic structure of the data, which in turn provides valuable references for subsequent data processing and decision-making.

Introduction to Clustering

In our daily lives, when we are faced with the need to organize a pile of our family's clothes, we are actually engaged in a clustering process 🧦👕👖🩲. We group similar clothes together based on their type, color, size, and other characteristics to make it easier to organize and store them. This simple life scenario is closely related to clustering methods in data science. In the field of data science, clustering is widely used to analyze user preferences, identify market trends, or determine potential features of any unlabeled data set.

By grouping similar data points, clustering helps us to understand complex information more clearly, thus somehow enabling us to better deal with cluttered states.

lexical awareness

Clustering, in simple terms, is the process of classifying or categorizing data. However, how to perform this categorization effectively is the focus of various types of algorithms that need to be studied and considered in depth. In this process, we encounter many proper nouns, and the understanding of these terms is crucial to mastering clustering methods and algorithms.

"Transduction" and "induction".

Transductive reasoning originates from observed training cases mapped to specific test cases. Inductive reasoning comes from training cases that map to general rules before being applied to test cases. It's okay if you don't understand, let me give you an example.

A brief example to illustrate the difference between transductive and inductive reasoning.

First understand inductive reasoning:

You observe three swans: one is white, another is white, and the third is white.
You conclude, based on these observations, that "all swans are white."

In the process, you deduce a general conclusion (all swans are white) from specific observations (specific white swans). However, this conclusion is not necessarily correct because you have not observed all swans.

The second is transduction reasoning:

You know the general principle: "All swans are birds."
You also know, "This animal is a swan."
Based on these two premises, you can conclude, "The animal is a bird."

In this process, you derive a specific conclusion (this swan is a bird) from a general rule (all swans are birds). This conclusion is logically necessary, as long as the premises hold.

So that simply means:

summarizeIt is a general rule derived from multiple specific examples, with possible exceptions.
conduct (heat, electricity etc)is based on a general rule to derive a conclusion for a particular instance, logically necessary.

"Non-planar" and "Planar" Geometry

Plane geometry is geometry on a two-dimensional plane (e.g., paper). Basic objects include points, lines, planes, angles, triangles, quadrilaterals, etc. For example, draw a triangle on a plane, measure side lengths and angles, and use the rules of Euclidean geometry, such as the Pythagorean theorem.

Non-planar geometry studies geometric properties in three or higher dimensions. It includes non-Euclidean geometries such as elliptic geometry and hyperbolic geometry. E.g., the study of geometry on the surface of a sphere, such as the measurement of great circles ("straight lines" on a sphere), e.g., the shortest paths (routes) on the Earth.

Geometry is closely related to machine learning in that data often exists in the form of geometric shapes. For example, point cloud data, images and videos can be considered as points in a high dimensional space. For example: when dealing with non-Euclidean data such as images, social networks, etc., it may be necessary to use the concepts of non-Euclidean geometry to analyze the structure and relationships of the data, and then again, geometrically, it may be possible to measure the similarity between different data points, which can help in classification and clustering tasks.

Clustering and Distance Matrix

Clusters are defined by their distance matrix, e.g. the distance between points. This distance can be measured in several ways. Euclidean clustering is defined by the average of the values of the points, and non-Euclidean distances refer to the "center of the cluster", i.e. the point closest to the other points. Let me explain what this means with a few examples.

Distance Matrix : A distance matrix is a table that records the distance between each pair of points in a data set. The rows and columns represent the data points and each element of the matrix represents the distance between the corresponding points.

Euclidean distance: This is the most commonly used distance measure for calculating straight-line distances between points in two or three dimensions. In clustering, the "center of mass" of a Euclidean cluster is the average position of all points. You can imagine that the center of mass is the "center" of each cluster.

Non-Euclidean distances: refers to distance calculations that do not follow the rules of Euclidean geometry. Examples include the Manhattan distance, the Chebyshev distance, etc. These distances usually reflect different spatial properties.

The example is illustrated again, assuming we have the following data points:

Point P1: (1, 2)
Point P2: (2, 3)
Point P3: (5, 8)
Point P4: (7, 9)

Euclidean clustering:

The center of mass may be a calculated point such as (3.75, 5.5), which is not necessarily a point that exists in the dataset.

Non-Euclidean clustering:

Suppose we choose the point P2 (2, 3) as the center because its distance from other points is small. This center is a real existing point in the dataset.

constrained clustering

Constraint clustering is a method that combines unsupervised and semi-supervised learning. It guides the clustering process by introducing some additional constraints to improve the quality and effectiveness of clustering. Here are some key points:

Type of constraint:

Must-link: If two data points are labeled as "must-link", they must be placed in the same cluster when clustering.
Cannot-link: If two data points are labeled as "cannot link", they must be placed in different clusters when clustering.

To analyze this for example, suppose we are clustering a set of items that include:

Round Plastic Toys
Square Plastic Toys
Triangle Metal Toys
Cookies (may be paper toys or other materials)

In the absence of constraints, the algorithm may group round and square toys into the same cluster and triangular toys and cookies into another cluster because the algorithm relies only on their features.

However, if we set the constraint - "all items must be made of plastic" - the algorithm will only be able to consider round and square plastic toys, so they will be clustered together correctly, while other items will not interfere with the process.

So constrained clustering improves the quality of the results by providing guiding information (must be linked or cannot be linked) that allows the clustering algorithm to process the data more efficiently and reduce the clustering of irrelevant items.

densities

At the time of examination, the distances between points in each cluster may be more or less dense or "crowded", so the data need to be analyzed using appropriate clustering methods.

For example, K-Means is suitable for datasets where the clusters have a regular shape and a more uniform density. It is sensitive to noise and noisy data may interfere with the clustering results. It is not effective in clustering with different densities.HDBSCAN is especially suitable for the case of uneven data density, and can automatically identify and exclude noise.

Therefore, when choosing a clustering algorithm, it is important to select the appropriate method based on the characteristics and needs of the data.

Common clustering algorithms

There is a wide variety of clustering algorithms, more than 100, and their applicability usually depends on the nature of the specific data and the goals of the analysis. To gain a deeper understanding of the characteristics of these algorithms, we will focus on two common clustering methods: hierarchical clustering and center-of-mass clustering.

Next, we will illustrate the characteristics of these two algorithms and their effectiveness in practical applications through specific examples.

hierarchical clustering

Hierarchical clustering is a method of clustering by establishing hierarchical relationships between data points. Its main idea is to gradually merge data points into clusters to form a tree structure (tree diagram or dendrogram).

There are two main approaches, while the illustration above shows the bottom-up approach:

Bottom-up (cohesive method, Agglomerative):
- Each data point starts as a separate cluster.
- In each step, find the two closest clusters and merge them into a new cluster.
- Repeat this process until all points are merged into one cluster.
Top-down (Divisive):
- All data points start as a cluster.
- Gradually splitting the cluster into smaller clusters.
- Repeat this process until each data point becomes a separate cluster.

If you still don't get it, let's take some more examples, suppose we have a set of fruit data including apples, oranges, bananas and strawberries. Using hierarchical clustering, each fruit is first treated as a separate cluster, and then gradually merged based on the similarities between the fruits, resulting in a tree structure that shows the relationships between the fruits. For example, apples and oranges might be merged first because they are both citrus, while bananas and strawberries might form new clusters in a later merge.

center-of-mass clustering

Center of mass clustering is a method of clustering based on the center of mass (the center point of a cluster), the most common example of which is the K-Means algorithm.

The steps of the K-Means algorithm are as follows:

Number of selection clusters (k)。
Randomly select (k) initial centers of mass。
Assign each data point to the nearest center of massThe (k) clusters are formed.
Calculate the new center of mass for each cluster, i.e., the average position of all points in the cluster.
Repeat steps 3 and 4, until the center of mass no longer changes (convergence).

Let's also give some examples, suppose we have a set of customer data that is clustered based on their purchasing behavior. Using K-Means and choosing k=3, customers can be categorized into three groups: frequent buyers, occasional buyers, and inactive customers. The algorithm will calculate the center of mass based on the purchasing habits of each category and adjust the attribution of each customer to end up with three well-defined customer groups.

These two types of algorithms also have their own limitations. Hierarchical clustering is suitable for small-scale datasets, generates hierarchical relationships of data, and is easy to visualize, but has a high computational complexity. Center-of-mass clustering is computationally efficient and suitable for large-scale datasets, but requires the number of clusters to be specified in advance and is sensitive to outliers.

summarize

In this paper, we explore the importance and application of clustering as an unsupervised learning method. Clustering not only helps us identify the intrinsic structure in our data, but also provides inspiration in real-life situations, such as the categorization process we use to organize our clothes on a daily basis. By understanding concepts such as distance matrix, constrained clustering and density, we are able to analyze data more accurately and choose the appropriate clustering algorithm. Whether it is hierarchical clustering or center-of-mass clustering, each method has its own unique advantages and applicable scenarios.

This concludes today's preliminaries, and we hope that you will quickly grasp these basic concepts to provide a good foundation for subsequent learning. In the next section, we will delve deeper into how to cluster analyze data and look forward to exploring this important topic with you.

I'm Rain, a Java server-side coder, studying the mysteries of AI technology. I love technical communication and sharing, and I am passionate about open source community. I am also a Tencent Cloud Creative Star, Ali Cloud Expert Blogger, Huawei Cloud Enjoyment Expert, and Nuggets Excellent Author.

💡 I won't be shy about sharing my personal explorations and experiences on the path of technology, in the hope that I can bring some inspiration and help to your learning and growth.

🌟 Welcome to the effortless drizzle! 🌟