Why hierarchical clustering




















For example, if you have several patients who come to visit your clinic, based on their symptoms you can probably group patients into different groups where each group of patients are broadly similar to each other in terms of the symptoms. There are a number of different approaches to generating clusters such as K-Means clustering, density-based clustering, Gaussian Mixture Models clustering, and Hierarchical clustering.

Each method has its own advantages and disadvantages. Several of the clustering algorithms only group the objects into different clusters and the resulting groups can be viewed on a cluster plot. A hierarchical clustering algorithm in addition to breaking up the objects into clusters also shows the hierarchy or ranking of the distance and shows how dissimilar one cluster is from the other.

In this article we will only focus on the hierarchical clustering. You would like to group your customers into two or more groups based on their distance such that all customers in one group can be served from a warehouse that is located close to that group. How does it work? Hierarchical clusters can be built either as agglomerative or divisive.

In the agglomerative model, the two closest nodes are combined together into one node and this process is repeated until all the nodes have been grouped together. In the divisive approach, all the nodes are together initially, and these are broken up into two groups and so on until only individual nodes remain.

In most cases, we will be using the agglomerative model to create the hierarchical clusters. In this example, nodes 4 and 10 are the closest together and these are combined into one cluster. Next, nodes 2 and 5 are group together. This process is repeated until only one node is left. The process of combining nodes is shown in Figure 2.

An example of the dendrogram is shown in Figure 3. Figure 3: Cluster Dendrogram for the given problem with 2 clusters Once all the groups have been determined, we get a tree like structure with the leaves being the individual nodes.

Note that the Y-axis on this plot shown as Height is the average distance between the nodes. For example, the distance between nodes 4 and 10 is 0. If you want to break-up the nodes into two groups, we determine the appropriate height that will split this into two groups. For example, if you draw a horizontal line at a height of 8 units, then we get two groups. The first group only contains 4 and 10 and the second group contains 2, 3, 5, 6, 7, 8, and 9.

If we wanted to breakup the data into three groups, we could have drawn a horizontal line at 7, in which case the three groups will be Group 1 4, 10 , Group 2 2, 3, 5, 8, 9 and Group 3 1, 6, 7. Hence, the final number of clusters required can be decided later on and is not required prior to building this hierarchical groups.

Algorithm Options There are two main options available for the hierarchical cluster analysis. The algorithm used to determine the distance between two nodes and the formula used to calculate the distance between clusters that contain multiple nodes. Distance Calculation There are several ways to calculate distance between two points. In the above example, we used the Euclidean calculation for distance which is the default and most commonly used.

Market research Social research commercial Customer feedback Academic research Polling Employee research I don't have survey data.

R in Displayr Visualizations. Keep updated with the latest in data science. What is Hierarchical Clustering? Twitter Facebook LinkedIn Email. Using Displayr. Working faster with large data files 12 Nov by Andrew Kelly. Boost your analysis with in-built Calculations 20 Aug by Andrew Kelly. Find and share the stories in your data easier. Prepare to watch, play, learn, make, and discover! Get access to all the premium content on Displayr First name.

Last name. Work email. Phone number. Last question, we promise! What type of survey data have you got? Select all that apply Market research Social research commercial Customer feedback Academic research Polling Employee research I don't have survey data. Well, there are many measures to do this, perhaps the most popular one is the Dunn's Index.

Dunn's index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points.

In order to have well separated and compact clusters you should aim for a higher Dunn's index. Now let's implement one use case scenario using Agglomerative Hierarchical clustering algorithm. The data set consist of customer details of one particular shopping mall along with their spending score.

You can download the dataset from here. Out of all the features, CustomerID and Genre are irrelevant fields and can be dropped and create a matrix of independent variables by select only Age and Annual Income. As we have already discussed to choose the number of clusters we draw a horizontal line to the longest line that traverses maximum distance up and down without intersecting the merging points. Hierarchical clustering is a very useful way of segmentation. The advantage of not having to pre-define the number of clusters gives it quite an edge over k-Means.

However, it doesn't work well when we have huge amount of data. Well, this comes to the end of this article. I hope you guys have enjoyed reading it. You can reach me out over LinkedIn for any query.

He has over 4 years of working experience in various sectors like Telecom, Analytics, Sales, Data Science having specialisation in various Big data components. Reposted with permission. By subscribing you accept KDnuggets Privacy Policy. What is Hierarchical Clustering? Previous post. Tags: Clustering , Machine Learning , Python.



0コメント

  • 1000 / 1000