Hierarchical clustering is a popular method in unsupervised machine learning and data analysis that groups similar data points into clusters. It builds a hierarchy of clusters, which can be represented as a tree-like structure called a dendrogram. Hierarchical clustering can be used for a variety of applications, such as image segmentation, document classification, and biology.
There are two main approaches to hierarchical clustering: agglomerative and divisive clustering.
1. Agglomerative Hierarchical Clustering: Agglomerative clustering starts with each data point as its own cluster and iteratively merges the most similar clusters until only one cluster remains.The process will be as following:
- Initialization: Start with each data point as a single cluster, so we have as many clusters as data points.
- Merge Closest Clusters: This step is done using linkage criteria like single linkage, complete linkage, average linkage, or Ward’s method.
- Repeat: We continue merging clusters until all data points belong to a single cluster or until a predefined number of clusters is reached.
2. Divisive Hierarchical Clustering: Divisive hierarchical clustering takes the opposite approach, starting with all data points in a single cluster and recursively dividing them into smaller clusters. This approach is less common than agglomerative clustering.In both agglomerative and divisive clustering, the choice of linkage criteria and the distance metric used to measure similarity or dissimilarity between data points play a crucial role in determining the final clusters.
Distance Metrics: The choice of distance metric depends on the nature of our data and problem. Common distance metrics include Euclidean distance, Manhattan distance.
Dendrogram: A dendrogram is a tree-like diagram that illustrates the hierarchical structure of clusters. It shows the order in which clusters were merged and can help us choose the appropriate number of clusters for our specific application.