Machine Learning with scikit-learn
Build, tune, and evaluate models using scikit-learn pipelines with reproducible ML workflows.
Content
Clustering with k-means
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Clustering with k-means — the party-planner of unlabeled data
You already know how to make predictions when labels exist (kNN, SVM, Naive Bayes). Now imagine you walk into a party where no one told you who’s friends with who — welcome to unsupervised learning.
We're picking up where you left off: you've seen supervised classifiers (kNN, SVM, Naive Bayes) and built statistical intuition for variance, means, and experimental uncertainty. Clustering with k-means is the most common entry point into unsupervised machine learning — it groups data by similarity without any labels. It's fast, intuitive, and occasionally dramatically wrong in entertaining ways.
What is k-means clustering? (Short, bold answer)
k-means partitions data into k groups by minimizing the within-cluster squared distances to the cluster centroids. Think: pick k table centers at a wedding and seat each guest at the nearest table until everyone’s relatively happy.
Micro explanation
- Centroid = mean of points assigned to a cluster (so yes — your stats background helps: centroid = sample mean).
- Objective: minimize sum of squared Euclidean distances from each point to its cluster centroid.
- Unsupervised: no labels used during training.
How k-means works — step-by-step (algorithm digest)
- Choose k (number of clusters).
- Initialize k centroids (randomly or with k-means++).
- Assign each point to the nearest centroid (Euclidean distance).
- Update each centroid to be the mean of points assigned to it.
- Repeat steps 3–4 until assignments stop changing or max iterations reached.
This is known as Lloyd’s algorithm. It optimizes a non-convex objective, so different initializations can lead to different local minima.
Why k-means behaves like the mean (statistics connection)
Because centroids are means. If you remember from statistics that the sample mean minimizes squared error (sum of squared deviations), then you already know why k-means uses the mean to represent a cluster — it’s the best single-point summary under an L2 loss.
This ties neatly back to your probability & stats course: k-means assumes clusters are compact and roughly spherical in the Euclidean sense, and it optimizes a variance-type objective within clusters.
Practical scikit-learn example (copy-paste ready)
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Make synthetic data
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=42)
X = StandardScaler().fit_transform(X)
# Fit k-means
k = 4
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
labels = km.fit_predict(X)
# Evaluate
print('Inertia (sum of squared distances):', km.inertia_)
print('Silhouette score:', silhouette_score(X, labels))
# Quick plot
plt.scatter(X[:,0], X[:,1], c=labels, cmap='tab10', s=30)
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], c='black', s=100, marker='X')
plt.title(f'k-means (k={k})')
plt.show()
Micro notes:
- Use StandardScaler if features are on different scales — k-means is scale-sensitive.
init='k-means++'is recommended: smarter initialization to reduce bad local minima.inertia_is the k-means objective (lower is better for fixed k), but it decreases with k so don’t use it alone to pick k.
Choosing k: elbow, silhouette, and sanity checks
- Elbow method: plot inertia versus k and look for a bend (the ‘elbow’). It’s heuristic and sometimes subtle.
- Silhouette score: ranges from -1 to 1. Higher is better — it measures how similar a point is to its own cluster vs next best cluster.
- Practical: also inspect cluster sizes, stability across different runs, and whether clusters make domain sense.
Why this matters: k is not discovered by k-means; it’s a hyperparameter you must propose. Use domain knowledge + diagnostics.
Strengths, weaknesses, and gotchas
Strengths
- Fast and scalable for large datasets.
- Simple, interpretable centroids.
- Integrates well into pipelines for feature engineering (e.g., cluster id as a feature).
Weaknesses
- Assumes spherical clusters of similar size — not great for elongated or density-varying clusters.
- Sensitive to feature scaling and outliers.
- Requires k up front.
Common mistakes
- Forgetting to scale features (results skewed to large-magnitude features).
- Using k-means on categorical features without embedding/encoding.
- Trusting inertia to choose k blindly.
When to use k-means vs. other methods (practical decisions)
- Use k-means when clusters are roughly spherical, numeric features are scaled, and you want a fast, baseline clustering.
- Choose DBSCAN if you need to find arbitrarily shaped clusters and handle noise; it discovers the number of clusters by density.
- Choose Gaussian Mixture Models when clusters overlap and you want soft (probabilistic) assignments.
Pro tip: if you used kNN and SVM earlier to classify, you can now try using labels from clustering as a feature or to perform semi-supervised learning. For instance, cluster IDs can capture coarse structure that a classifier can refine.
Real-world uses (so you don’t think this is just academic)
- Customer segmentation (market baskets into groups)
- Image compression (k-means colors = palette of k colors)
- Anomaly detection (tiny clusters or outliers indicate anomalies)
- Preprocessing for supervised models (add cluster ID as a categorical feature)
Quick checklist before running k-means
- Numeric features? Scale them (StandardScaler or MinMax).
- Try k-means++ initialization and multiple n_init.
- Use elbow + silhouette + domain sense to pick k.
- Visualize clusters whenever possible.
- Consider alternatives if clusters are non-spherical or noisy.
Takeaways (the bits you'll tell your future self)
- k-means clusters by minimizing intra-cluster squared distances; centroids are means.
- It’s fast and useful but makes strong geometric assumptions (spherical, equal-size clusters, numeric scales).
- Your stats intuition helps: the centroid is the L2-optimal point — that’s why k-means is linked to variance minimization.
This is the moment where the concept finally clicks: k-means is just repeatedly asking, “who’s nearest to my mean?” until the room stops rearranging itself.
If you want, I can: show a live notebook with elbow/silhouette plots, demonstrate k-means failure cases (elongated clusters, different densities), or walkthrough using cluster labels as features in a classification pipeline (bridging your kNN/SVM knowledge). Which would help you most next?
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!