K means clustering: Explained with Jupyter Notebook – In this tutorial, we are going to introduce K-means clustering algorithm. All right, let’s get started. K-means is very simple, but also very effective unsupervised clustering algorithm. It tries to partition the datasets into K pre-defined clusters where each data points belongs to only one group. Cool.
Let’s go through a simple 2 dimensional example to have a better understanding of this idea. There are 3 major steps.
In step 1, we randomly choose K data points from the data set as initials center. In this case, K is equal to three and we will discuss how do we choose K later. OK, once we have these centers, In step two, we can calculate the distance between the data points to these centers. Then we can assign each data points to the closest centers. In this example, for point A, we compute its distance to C1, C2 and C3 respectively.
And after comparing the length of d1, d2 and d3, we figured it out that d3 is the smallest. Therefore, we assign A to C3.
We then move to point B and follows the same procedure. Cool. We keep repeating this process until all the data points are assigned. Then we calculate the center’s new position by taking the average of all the points within the cluster. Keep repeating this process until all the centers stop moving.
Now it’s the coding time. We’re going to show you how to train and visualize K-means algorithm in Jupyter notebook. Let’s first introduce all necessary libraries.
Here we are using numpy for data manipulation, make_blobs from sklearn dataset for data preparation, K Means from sklearn clusters for model training. And matplotlib and seaborn for Visualization. Cool, So the first step is to create and load the dataset. make_blobs from sklearn dataset will help us to generate some gaussian blobs for clustering.
Now let’s go through the parameters of this function. n_samples is the total number of points equally divided among the clusters in this dataset. n_features is the number of features for each samples. In this case, we have n_features as 2, which means the dataset is in 2 dimension. We set centers equal to 3, which means there are 3 clusters generated in this dataset. cluster_std is the standard deviation of each gaussian cluster, the random_state parameter is just the random state for the data preparation.
We can reproduce the data generation process with the same random_state setting. Great. Now let’s have a look at the generated X and Y X has 200 samples in it. There are 2 columns in the dataset, each columns represent a feature for that sample. Y is just a list of clusters labels for each sample.
As expected, both of them has 200 samples in them. Now let’s try to visualize the data set in the plot. Here we are using scatter function from matplotlib to visualize the data set in two dimensions. If you have more than two features, you can use some dimension reduction techniques like PCA to reduce the features in to 2 dimension and then visualize it in a 2 dimensional space.
From the plot, you may find that there are a lot of overlapping between different clusters and there are no clear boundary between them. The reason for that is that we have set the gaussian standard deviation as one, which is a relatively large standard deviation. Here, I encourage you to experiment with the standard deviation parameter offline. Trying to reduce it to a smaller value and you will find a clear boundary between different clusters. Cool.
Let’s move to how to train a K-means model with this dataset. Here we are using K-means class in sklearn. As usual, Let’s have a closer look at the parameters in this class. First, we have n_clusters set as 3. n_cluster is the number of clusters to form as well as the number of centers to generate.
This is one of the most important hyperparameter for this model and we will tell you how to choose this number soon. init defines the method for initialization. Here we set it as random. In this case, we choose three observation at random from the data set for the initial centers. n_init is the number of times that K-means algorithm will be run with different centers as initial states.
The final result will be the best output of n_init consecutive runs in terms of evaluation metrics. In the k-means class, we can access the metrics as an attribute which defines as the sum of squared distance of samples to the closest cluster center. This metrics can be also used with grid search to find the best n_clusters parameter. The max_iter parameter defines the maximum number of iterations of the K-means algorithm for a single run.
As we have shown before, K-means is an iterative method. The center will family converge to some final position. So the tol parameters defines how do we declare convergence based on the difference between the metrics, between different iterations. As usual, we define random state for the purpose of reproducing.
The next step from the code perspective is very straightforward. It has only one line of code which will help us to do model training and prediction at the same time. It is just a list of integers to represent which cluster each sample belongs to. Cool, the final step is to plot the final cluster prediction.
Finally , let’s have a look at the animation of the K-means cluster algorithm start from iteration 1 for this particular example, the model converge at around 7 iterations. So this is a brief introduction of how k-means work.
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!