Can you explain how k-means clustering works?

By: Bart Baesens, Seppe vanden Broucke

This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.


You asked: Can you explain how k-means clustering works?

Our answer:

K-means clustering is a non-hierarchical procedure that works along the following steps:

  1. Select K observations as initial cluster centroids (seeds).
  2. Assign each observation to the cluster that has the closest centroid (for example, in Euclidean sense).
  3. When all observations have been assigned, recalculate the positions of the K centroids.
  4. Repeat until the cluster centroids no longer change.

A key requirement here is that the number of clusters, K, needs to be specified before the start of the analysis.  It is also advised to try out different seeds to verify the stability of the clustering solution.  This decision can be made using expert-based input or based on the result of another (e.g., hierarchical) clustering procedure.  Typically, multiple values of K are tried out and the resulting clusters evaluated in terms of their statistical characteristics and interpretation.  It is also advised to try out different seeds to verify the stability of the clustering solution.