Interpretable K-Means: Clusters Feature Importances (2024)

Model Interpretability

Understand your K-Means clusters by extracting each cluster’s most important features.

Published in

Towards Data Science

11 min read

Jul 20, 2021

Interpretable K-Means: Clusters Feature Importances (3)

Machine learning models go through many stages for them to be considered production-ready. One critical stage is that moment of truth where the model is given a scientific green light; Model Evaluation. Many evaluation metrics are designated for different purposes and problem specifications, but none of them is flawless. Hence, data scientists have a huge burden to carry when choosing which evaluation metric to use that best fits the problem at hand and serves as a functional decision-inducing value for a machine learning project.

However, some evaluation metric values could seem unintuitive and cannot be explained in laymen’s terms, especially when evaluating unsupervised clustering algorithms using internal indexes (No external information is present such as true labels). Although these measurements serve a considerable benefit for comparative benchmarking, combining them with an interpretation is crucial for independent, intuitive validation and easier results communication with stakeholders.

Data scientists tend to lose a focal point in the evaluation process when it comes to internal validation indexes, which is the intuitive “Human” understanding of the model’s performance and its explanation. To elaborate by a counterexample, supervised classification evaluation metrics such as Accuracy, Precision, or Recall have very intuitive explanations for both laymen and experts. For instance:

Accuracy: “We have correctly classified this percentage of samples out of all samples we have.”
Precision: “We have correctly classified this percentage of Positives out of all predicted positives.”
Recall: “We have correctly classified this percentage of Positives out of all actual positives.”

On the other hand, internal validation indexes such as the widely used Silhouette Coefficient, Calinski-Harabasz Index, Davies-Bouldin Index, and many others are, more often than not, used comparatively rather than an inherent and independent performance evaluation. The values of these measurements can be broken down to how compact each cluster is (How similar the cluster data points are to each other), how well-separated the clusters are (How dissimilar each cluster’s data points are from other clusters’ data points), or both. But, these measurements are not as easy to explain and connect to the clustering problem you are solving.

Notwithstanding how valuable these measures are, obtaining an interpretation for a model along with an internal evaluation index is crucial when trying to understand how good your model is as a data scientist or attempting to deeply communicate your results to stakeholders where most internal validation indexes do not pass the ELI5 (Explain it Like I’m 5) Test :). This article will focus on one of the most popular unsupervised clustering algorithms; The K-Means, and presents two possible techniques to extract the most important features for each cluster. The article’s outline is as follows:

K-Means Business Value
How K-Means Works
Interpretation Techniques
Real-Life Application

If you already know how K-Means works, jump to the Interpretation Techniques section, or would like to visit the repository for this article and use the code directly, visit kmeans-feature-importance.

Say that you are running a business with thousands of customers, and you would want to know more about your customers, albeit how many you have. You cannot study each customer and cater a marketing campaign specifically for them, for example. Yet, you know that each customer will probably have different demographic (e.g., Age), Geographic (e.g., Region), Psychographic (e.g., Spending Habits), and Behavioral (e.g., Engagement) properties. Segmenting them into multiple similar groups will simplify understanding who they are from the most prevalent properties of each group. The most popular algorithm to solve this problem, I am sure you guessed it! Is K-Means; “K” will refer to the number of possible customer segments/groups/clusters, and “Means” can be thought of as imaginary persons who are the center (Most similar to their group members) of each group in the K groups.

K-Means is an unsupervised clustering algorithm that groups similar data samples in one group away from dissimilar data samples. Precisely, it aims to minimize the Within-Cluster Sum of Squares (WCSS) and consequently maximize the Between-Cluster Sum of Squares (BCSS). K-Means algorithm has different implementations and conceptual variations, but for the sake of this article, We will focus on the most common method, namely Lloyd’s algorithm (Naive K-Means), which follows an iterative approach to find a sub-optimal solution. Before starting, we will prepare a simple dataset for explanation and see how it looks like.

Interpretable K-Means: Clusters Feature Importances (4)

Interpretable K-Means: Clusters Feature Importances (5)

The steps we need to do to cluster the data points above into K groups using K-Means are:

Step 1 — Choosing Initial Number of Groups/Clusters (K)

A centroid represents each cluster; The mean of all data points assigned to that cluster. Choosing an initial number of groups is synonymous with choosing an initial number of centroids K. We can decide on K = 3 and choose 3 distinct data points randomly from the dataset as our initial centroids (There are better methods to choose the initial coordinates of each centroid).

Interpretable K-Means: Clusters Feature Importances (6)

Step 2 — Assigning Data Points to Clusters Centroids

We have to assign each data point in our dataset to the closest centroid. We will use the most common distance metric as our definition of “close” through the following steps:

1. Calculate the Euclidean distance between each centroid and all data points using the equation below for each j-th cluster centroid C_j, and one point pfor a dataset that has d-Dimensions. (My usage of C_j = u_c_j (u = average) = Cluster Centroid is only a simplification since the equations I will list are in a single cluster level).

Interpretable K-Means: Clusters Feature Importances (7)

2. To minimize the WCSS, we assign each data point to its closest centroid (Most similar / Least Distant). The reason why this will be a WCSS minimization step is from the equation for one cluster’s WCSS with p_m number of points assigned to the cluster centroid C_jwhere the shorter the distance for the points assigned to the cluster centroid, the lower its WCSS.

Interpretable K-Means: Clusters Feature Importances (8)

Interpretable K-Means: Clusters Feature Importances (9)

Step 3 — Updating Clusters Centroids’ Positions

The last step is to update the cluster centroids' positions since different data points were assigned to each cluster. We have to move each centroid to the mean of all data points assigned to it by:

Calculating the mean of each cluster’s data points
Setting the new cluster centroid to the new mean for each cluster
Repeating Step 2 and Step 3 until the cluster centroids (the new means) do not change

Using sklearn.cluster.KMean ; We will end up with the following where you can take, for example, Sample 0 and Sample 1features, calculate their mean and then check if it equals to Cluster Centroid D1 and Cluster Centroid D2:

Interpretable K-Means: Clusters Feature Importances (10)

The plot below shows how K-Means clustered the dataset where each centroid is positioned at the mean of the points assigned to it:

Interpretable K-Means: Clusters Feature Importances (11)

Approach 1: WCSS Minimizers

Disclaimer: The author has developed this method. Please do not hesitate to cite any references if found or criticize the methodology.

This approach is a direct analysis of each centroid’s sub-optimal position. Since K-Means aim is to minimize the Within-Cluster Sum of Squares and assuming that the distance metric used is the euclidean distance: We can find the dimensions that were responsible for the highest amount of WCSS (The sum of squares of each data point distance to its cluster centroid) minimization for each cluster through finding the maximum absolute centroid dimensional movement.

Using the same explanation example above, we can access cluster_centers_ from sklearn.cluster.KMeanfitted model; The final cluster centroids’ positions. Then show the feature names (Dimensions d):

Plotting the clusters along with their positions would yield the following:

Interpretable K-Means: Clusters Feature Importances (12)

Now let’s take each centroid and sort it by the dimensions axis, and remember to take the absolute value if you have features that have negative values.

After that, we can get the first centroid’s dimensions with sorted weights (Centroid distance traveled on features plane) and map that to feature names which will give us the hypothetical features importances below:

Since the centroid has moved with equal distances through both dimensions, the features will be of equal importance. Let’s give the last two samples (Belonging to cluster 0) a nudge on the first dimension and repeat the same process, then plot the new centroids

Interpretable K-Means: Clusters Feature Importances (13)

We will now apply the same approach again and extract the feature importances. Notice that cluster 0 has moved on feature one much more than feature 2 and thus has had a higher impact on WCSS minimization.

Approach 2: Unsupervised to Supervised

This approach is model-agnostic; Not exclusive to K-Means, in which we convert the unsupervised clustering problem into a One-vs-All supervised classification problem using an easily interpretable classifier such as tree-based models. The steps to do this are as follows:

Change the cluster labels into One-vs-All binary labels for each
Train a classifier to discriminate between each cluster and all other clusters
Extract the feature importances from the model (We will be using sklearn.ensemble.RandomForestClassifier)

Interpretable K-Means: Clusters Feature Importances (14)

Following from step 1, let us set cluster 0 label as 1 and set all other cluster labels to 0:

We have converted the problem into a binary classification problem. What is left is to train a classifier and use its feature_importances_ method implemented in scikit-learn to get the features that have the most discriminatory power between all clusters and the targeted cluster. We also need to map them to their feature names sorted by the weights.

One important note is that this approach finds what discriminates between two clusters and is not at all inherent to the targeted cluster. Furthermore, there are many other sophisticated methods to extract the most important features from tree-based models and model-agnostic approaches which you can try.

I have chosen to apply the interpretation technique on an NLP problem since we can easily relate to the feature importances (English words), which could be considered as a group-based keyword extraction technique where we aim to cluster similar documents together using K-Means and then apply the techniques above. The dataset I will be using can be found here Kaggle BBC-News, which presents a classification problem. We will exclude the category column at first (Sport, Business, Politics, Entertainment, and Technology News articles) and use it later as a proof-of-concept.

The distribution of news categories in the dataset is:

Interpretable K-Means: Clusters Feature Importances (15)

Since this article is not NLP-Specific (Natural Language Processing), I won’t intensively go through any NLP-related tasks. So, quickly moving towards preprocessing the text to prepare the features. The code below normalizes the words and filtering tokens (words and digits) that do not present a discriminatory power and are thus redundant. Then calculates the highest mentioned words in the whole dataset and plots them (You can skip the code):

Interpretable K-Means: Clusters Feature Importances (16)

Using TF-IDF (Text Representation Technique), we can convert the categorical variables (Words) into numeric representations. We do not need to scale the features here as TF-IDF normalizes features within its equation, and its output should be used in its raw form. Remember to scale the features if you apply K-Means on a dataset with features of different units or ranges where this difference is not relevant to the problem.

Finally, now is the step we care about the most; I have wrapped up the method above in a class that inherits from sklearn.cluster.KMean and can be used in the same way with the difference of having feature_importances_ property added. You will have to provide the features ordered in the same way as X parameter in the fit method to ordered_feature_namesparameter for the modified class when initializing.

You can find the code here kmeans-feature-importance and simply clone it like this in your favorite CLI or simply follow through by accessing the Colab example in the repository:

git clone https://github.com/YousefGh/kmeans-feature-importance.git

And then run the modified KMeans class with k= number of news dataset categories so that we can compare the results later with the actual categories.

Let’s check if K-Means has produced a cluster distribution similar to the category distribution in the news dataset.

Interpretable K-Means: Clusters Feature Importances (17)

That’s a close-enough category distribution similarity to get us going. Don’t forget to ensure that K-Means has produced accurate results by using different internal validation indexes (I won’t be going through them as this will be out-of-scope), and you probably will not have true labels, so you will need to choose the best K in K-Means if K is unknown from the problem domain knowledge.

Moving on to interpretation, we can access the feature_importances_for the second cluster cluster 1 like this (The example is for WCSS Minimizers):

Approaches Comparison

We will compare both the WCSS Minimizers method and the Unsupervised to Supervised problem conversion method using the feature_importance_methodparameter in KMeanInterp class. The flow will be as follows:

Plot categories distribution for comparison with unique colors
set feature_importance_methodparameter as wcss_min and plot feature importances
set feature_importance_methodparameter as unsup2supand plot feature importances
Infer the category of each cluster using its most important features

Interpretable K-Means: Clusters Feature Importances (18)

WCSS Minimizers

Interpretable K-Means: Clusters Feature Importances (19)

Interpretable K-Means: Clusters Feature Importances (20)

Unsupervised to Supervised

Interpretable K-Means: Clusters Feature Importances (21)

Interpretable K-Means: Clusters Feature Importances (22)

Clustering Interpretability becomes crucial when truth labels are not available at development time. It not only prevents data scientists from a direct evaluation of clustering validity due to the nature of internal validation indexes but also obstructs a simple and intuitive explanation of cluster performance to stakeholders. We have presented two possible approaches that aim to tackle this through extracting cluster-based feature importance, which allows us to know why the K-Means algorithm has chosen each cluster to be as such. The approach extends itself to stakeholder communication, simple and intuitive evaluation, cluster-based Keyword Extraction in NLP, and a general feature selection technique.

The notebook for this article, KMeansInterp class, along with a direct usage example on Colab, can be found here. Happy Interpretation!

References:

Y. Liu, Z. Li, H. Xiong, X. Gao and J. Wu, “Understanding of Internal Clustering Validation Measures,” 2010 IEEE International Conference on Data Mining, 2010, pp. 911–916, doi: 10.1109/ICDM.2010.35.
Kriegel, HP., Schubert, E. & Zimek, A. The (black) art of runtime evaluation: Are we comparing algorithms or implementations?. Knowl Inf Syst 52, 341–378 (2017). https://doi.org/10.1007/s10115-016-1004-2
Ng, A., & Piech, C. (2021). CS221. Retrieved 18 July 2021, from https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
Ismaili, Oumaima & Lemaire, Vincent & Cornuéjols, Antoine. (2014). A Supervised Methodology to Measure the Variables Contribution to a Clustering. 159–166. 10.1007/978–3–319–12637–1_20.
“2.3. Clustering — scikit-learn 0.24.2 documentation”, 2021

Interpretable K-Means: Clusters Feature Importances (2024)

FAQs

How do you interpret the results of k-means clustering? ›

Interpret the key results for Cluster K-Means

Step 1: Examine the final groupings. Examine the final groupings to see whether the clusters in the final partition make intuitive sense, based on the initial partition you specified. ...
Step 2: Assess the variability within each cluster.

Read The Full Story ›

How to know which features are important for clustering? ›

You can use various methods to assess the relevance of your variables, such as correlation analysis, feature importance, or domain knowledge. You should also consider the scale and type of your variables, as they can affect the distance or similarity measures used for clustering.

How do you interpret clustering data? ›

Some tips to interpret clustering results- Visualize the clusters: Use scatter plots, heatmaps, or dendrograms to visualize the distribution of data points within clusters. Validate cluster quality: Assess the quality of clusters using metrics like silhouette score, purity, or Rand index.

Know More ›

Is feature scaling important in k-means clustering? ›

In most cases yes. But the answer is mainly based on the similarity/dissimilarity function you used in k-means. If the similarity measurement will not be influenced by the scale of your attributes, it is not necessary to do the scaling job.

Read On ›

How do you Analyse k-means clusters? ›

How k-means cluster analysis works

Step 1: Specify the number of clusters (k). ...
Step 2: Allocate objects to clusters. ...
Step 3: Compute cluster means. ...
Step 4: Allocate each observation to the closest cluster center. ...
Step 5: Repeat steps 3 and 4 until the solution converges.

How do you evaluate the k-means clustering model? ›

The silhouette score and plot are used to evaluate the quality of a clustering solution produced by the k-means algorithm. The silhouette score measures the similarity of each point to its own cluster compared to other clusters, and the silhouette plot visualizes these scores for each sample.

Get More Info Here ›

How to tell if data is clustered enough for clustering algorithms to produce meaningful results? ›

It is a sample-wise approach, which means that for each sample, a Silhouette score is computed, and if most samples have a high value, then the clustering configuration is considered appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

Show Me More ›

How do you know if clustering is significant? ›

The signicance of a given clustering is judged by calculating an appropriate p-value. The SigClust method uses a test statistic called the cluster index (CI) which is defined to be the sum of within-class sums of squares about the mean divided by the total sum of squares about the overall mean.

Discover More ›

How to find important variables in clustering? ›

For every variable, calculate the average similarity of each object to its centroid. A variable that has high similarity between a centroid and its objects is likely more important to the clustering process than a variable that has low similarity.

Read The Full Story ›

How do you evaluate clustering results? ›

They provide a way to assess the quality of clustering based on the attributes of the data itself.

Inertia (Within-Cluster Sum of Squares) ...
Silhouette Coefficient. ...
Davies-Bouldin Index. ...
Calinski-Harabasz Index (Variance Ratio Criterion) ...
Rand Index (RI) ...
Adjusted Rand Index (ARI) ...
Normalized Mutual Information (NMI)

Jan 16, 2024

View Details ›

How to show clustering results? ›

Depending on the type and number of variables and clusters, you can use different types of charts, such as scatter plots, dendrograms, heatmaps, or radar charts. For example, if you have two numerical variables and a few clusters, you can use a scatter plot with different colors or shapes for each cluster.

How do you interpret clustering coefficients? ›

the clustering coefficient is a basic measure for local density (connections) in a network. The number gives a 'probability' that two friends of a node are connected too (if A is connected to B and to C, how probable is the connections between B and C) .

Show Me More ›

How do you select important features for clustering? ›

How to do feature selection for clustering and implement it in...

Perform k-means on each of the features individually for some k.
For each cluster measure some clustering performance metric like the Dunn's index or silhouette.
Take the feature which gives you the best performance and add it to Sf.

More items...

Jan 25, 2020

What are features in K-means clustering? ›

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. The term 'K' is a number.

See Details ›

Do I need to normalize data before clustering? ›

3 Normalize or standardize your data

This is important for clustering because different features may have different units, magnitudes, and distributions, which can affect the distance calculations and cluster assignments.

Tell Me More ›

How do you interpret K function? ›

Interpreting weighted K-function results

You can think of the weight as representing the number of coincident features at each feature location. For example, a feature with a weight of 3 may be interpreted as 3 coincident features. There is one difference, however: a feature cannot be its own neighbor.