Clustering with K-Means and Hierarchical Agglomerative Clustering(HAC)
Clustering is a technique that aims to group similar observations together while differentiating the groups from other groups. The number of groups is not defined, and therefore clustering is referred to as an “unsupervised learning” technique. Data points can be grouped in either a top-down or bottom-up approach. The approaches speak about how the data points should be grouped. A top-down approach takes the presumed number of groups and associates each observation to each group. A bottom-up approach treats each data point as its own group and continually merge until the assigned number of clusters.
The difference between these two approaches relates to the type of machine learning tool to implement. A commonly used top-down clustering is K-Means and a commonly used bottom-up clustering is Hierarchical Agglomerative Cluster(HAC).
K-Means Clustering
The number of presumed groups(k) must be first assigned before the clustering begins. The algorithm works by randomizing the k centroids and assigning each observation to a cluster based on closeness. The centroid is then moved to the center of the cluster and the data points are re-assigned to a cluster(sometimes new) most similar to the moved centroid.
- Randomize k-number of centroids.
- Assign each data point to a cluster based on similarity to the centroid.
- Move the centroid to the center of the cluster.
- Repeat Steps 2–3 until the centroid no longer moves.
It is important to note that data points can change cluster belonging after the centroid gets re-centered. This is one of K-Means’ advantages over a bottom-up approach where data points can not be re-assigned.
In unsupervised learning, the number of k-clusters is not defined and the groups can be assigned as below.
Users can evaluate the number of clusters with metrics such as the Calinski Harabasz Score, also known as the ‘variance ratio.’ The ratio accounts for the variance of intracluster distance and the intercluster distance. The idea is that the intracluster variance should be low and the intercluster distance to be high. The ratio range has no bounds, and a higher score suggests a better fit.
Hierarchical Agglomerative Clustering
Another popular algorithm is HAC which uses a bottom-up approach. Instead of first assigning the data points to k number of clusters, a bottom-up approach first assumes every observation to be its own cluster. The data points are merged to a cluster based on “similarity” until there is k number of clusters. Similarity can be evaluated through Sci-kit learn’s Agglomerative Clustering linkage methods: ward, average, complete, single. This method is a bit different from k-means where similarity is based on the cluster centroid.
- ‘ward’ minimizes the variance of the clusters being merged.
- ‘average’ uses the average of the distances of each observation of the two sets.
- ‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.
- ‘single’ uses the minimum of the distances between all observations of the two sets.
— https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
It’s important to note that in HAC, data points are not reassigned to a new cluster. In ‘single’ linkage, clusters were merged based on the closest distance between 2 points from each cluster. HAC algorithms can help users understand a dataset by analyzing when/where two clusters were merged.