Courses/Python for Data Science, AI & Development/Machine Learning with scikit-learn

Machine Learning with scikit-learn

44937 views

Build, tune, and evaluate models using scikit-learn pipelines with reproducible ML workflows.

Content

10 of 15

Clustering with k-means

Clustering with k-means in scikit-learn: Practical Guide

3817 views

beginner

machine-learning

python

scikit-learn

humorous

gpt-5-mini

3817 views

Versions:

Clustering with k-means in scikit-learn: Practical Guide

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Clustering with k-means — the party-planner of unlabeled data

You already know how to make predictions when labels exist (kNN, SVM, Naive Bayes). Now imagine you walk into a party where no one told you who’s friends with who — welcome to unsupervised learning.

We're picking up where you left off: you've seen supervised classifiers (kNN, SVM, Naive Bayes) and built statistical intuition for variance, means, and experimental uncertainty. Clustering with k-means is the most common entry point into unsupervised machine learning — it groups data by similarity without any labels. It's fast, intuitive, and occasionally dramatically wrong in entertaining ways.

What is k-means clustering? (Short, bold answer)

k-means partitions data into k groups by minimizing the within-cluster squared distances to the cluster centroids. Think: pick k table centers at a wedding and seat each guest at the nearest table until everyone’s relatively happy.

Micro explanation

Centroid = mean of points assigned to a cluster (so yes — your stats background helps: centroid = sample mean).
Objective: minimize sum of squared Euclidean distances from each point to its cluster centroid.
Unsupervised: no labels used during training.

How k-means works — step-by-step (algorithm digest)

Choose k (number of clusters).
Initialize k centroids (randomly or with k-means++).
Assign each point to the nearest centroid (Euclidean distance).
Update each centroid to be the mean of points assigned to it.
Repeat steps 3–4 until assignments stop changing or max iterations reached.

This is known as Lloyd’s algorithm. It optimizes a non-convex objective, so different initializations can lead to different local minima.

Why k-means behaves like the mean (statistics connection)

Because centroids are means. If you remember from statistics that the sample mean minimizes squared error (sum of squared deviations), then you already know why k-means uses the mean to represent a cluster — it’s the best single-point summary under an L2 loss.

This ties neatly back to your probability & stats course: k-means assumes clusters are compact and roughly spherical in the Euclidean sense, and it optimizes a variance-type objective within clusters.

Practical scikit-learn example (copy-paste ready)

from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Make synthetic data
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=42)
X = StandardScaler().fit_transform(X)

# Fit k-means
k = 4
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
labels = km.fit_predict(X)

# Evaluate
print('Inertia (sum of squared distances):', km.inertia_)
print('Silhouette score:', silhouette_score(X, labels))

# Quick plot
plt.scatter(X[:,0], X[:,1], c=labels, cmap='tab10', s=30)
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], c='black', s=100, marker='X')
plt.title(f'k-means (k={k})')
plt.show()

Micro notes:

Use StandardScaler if features are on different scales — k-means is scale-sensitive.
init='k-means++' is recommended: smarter initialization to reduce bad local minima.
inertia_ is the k-means objective (lower is better for fixed k), but it decreases with k so don’t use it alone to pick k.

Choosing k: elbow, silhouette, and sanity checks

Elbow method: plot inertia versus k and look for a bend (the ‘elbow’). It’s heuristic and sometimes subtle.
Silhouette score: ranges from -1 to 1. Higher is better — it measures how similar a point is to its own cluster vs next best cluster.
Practical: also inspect cluster sizes, stability across different runs, and whether clusters make domain sense.

Why this matters: k is not discovered by k-means; it’s a hyperparameter you must propose. Use domain knowledge + diagnostics.

Strengths, weaknesses, and gotchas

Strengths

Fast and scalable for large datasets.
Simple, interpretable centroids.
Integrates well into pipelines for feature engineering (e.g., cluster id as a feature).

Weaknesses

Assumes spherical clusters of similar size — not great for elongated or density-varying clusters.
Sensitive to feature scaling and outliers.
Requires k up front.

Common mistakes

Forgetting to scale features (results skewed to large-magnitude features).
Using k-means on categorical features without embedding/encoding.
Trusting inertia to choose k blindly.

When to use k-means vs. other methods (practical decisions)

Use k-means when clusters are roughly spherical, numeric features are scaled, and you want a fast, baseline clustering.
Choose DBSCAN if you need to find arbitrarily shaped clusters and handle noise; it discovers the number of clusters by density.
Choose Gaussian Mixture Models when clusters overlap and you want soft (probabilistic) assignments.

Pro tip: if you used kNN and SVM earlier to classify, you can now try using labels from clustering as a feature or to perform semi-supervised learning. For instance, cluster IDs can capture coarse structure that a classifier can refine.

Real-world uses (so you don’t think this is just academic)

Customer segmentation (market baskets into groups)
Image compression (k-means colors = palette of k colors)
Anomaly detection (tiny clusters or outliers indicate anomalies)
Preprocessing for supervised models (add cluster ID as a categorical feature)

Quick checklist before running k-means

Numeric features? Scale them (StandardScaler or MinMax).
Try k-means++ initialization and multiple n_init.
Use elbow + silhouette + domain sense to pick k.
Visualize clusters whenever possible.
Consider alternatives if clusters are non-spherical or noisy.

Takeaways (the bits you'll tell your future self)

k-means clusters by minimizing intra-cluster squared distances; centroids are means.
It’s fast and useful but makes strong geometric assumptions (spherical, equal-size clusters, numeric scales).
Your stats intuition helps: the centroid is the L2-optimal point — that’s why k-means is linked to variance minimization.

This is the moment where the concept finally clicks: k-means is just repeatedly asking, “who’s nearest to my mean?” until the room stops rearranging itself.

If you want, I can: show a live notebook with elbow/silhouette plots, demonstrate k-means failure cases (elongated clusters, different densities), or walkthrough using cluster labels as features in a classification pipeline (bridging your kNN/SVM knowledge). Which would help you most next?

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics