jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

5Data Cleaning and Feature Engineering

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

ML Workflow and PipelinesData Splits and CV StrategiesClassification MetricsRegression MetricsLinear and Logistic RegressionDecision Trees and ForestsGradient Boosting MethodskNN and SVMNaive Bayes ModelsClustering with k-meansDimensionality Reduction with PCAHyperparameter TuningModel InterpretationHandling Class ImbalanceSaving and Loading Models

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Machine Learning with scikit-learn

Machine Learning with scikit-learn

44934 views

Build, tune, and evaluate models using scikit-learn pipelines with reproducible ML workflows.

Content

10 of 15

Clustering with k-means

Clustering with k-means in scikit-learn: Practical Guide
3817 views
beginner
machine-learning
python
scikit-learn
humorous
gpt-5-mini
3817 views

Versions:

Clustering with k-means in scikit-learn: Practical Guide

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Clustering with k-means — the party-planner of unlabeled data

You already know how to make predictions when labels exist (kNN, SVM, Naive Bayes). Now imagine you walk into a party where no one told you who’s friends with who — welcome to unsupervised learning.

We're picking up where you left off: you've seen supervised classifiers (kNN, SVM, Naive Bayes) and built statistical intuition for variance, means, and experimental uncertainty. Clustering with k-means is the most common entry point into unsupervised machine learning — it groups data by similarity without any labels. It's fast, intuitive, and occasionally dramatically wrong in entertaining ways.


What is k-means clustering? (Short, bold answer)

k-means partitions data into k groups by minimizing the within-cluster squared distances to the cluster centroids. Think: pick k table centers at a wedding and seat each guest at the nearest table until everyone’s relatively happy.

Micro explanation

  • Centroid = mean of points assigned to a cluster (so yes — your stats background helps: centroid = sample mean).
  • Objective: minimize sum of squared Euclidean distances from each point to its cluster centroid.
  • Unsupervised: no labels used during training.

How k-means works — step-by-step (algorithm digest)

  1. Choose k (number of clusters).
  2. Initialize k centroids (randomly or with k-means++).
  3. Assign each point to the nearest centroid (Euclidean distance).
  4. Update each centroid to be the mean of points assigned to it.
  5. Repeat steps 3–4 until assignments stop changing or max iterations reached.

This is known as Lloyd’s algorithm. It optimizes a non-convex objective, so different initializations can lead to different local minima.


Why k-means behaves like the mean (statistics connection)

Because centroids are means. If you remember from statistics that the sample mean minimizes squared error (sum of squared deviations), then you already know why k-means uses the mean to represent a cluster — it’s the best single-point summary under an L2 loss.

This ties neatly back to your probability & stats course: k-means assumes clusters are compact and roughly spherical in the Euclidean sense, and it optimizes a variance-type objective within clusters.


Practical scikit-learn example (copy-paste ready)

from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Make synthetic data
X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=42)
X = StandardScaler().fit_transform(X)

# Fit k-means
k = 4
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
labels = km.fit_predict(X)

# Evaluate
print('Inertia (sum of squared distances):', km.inertia_)
print('Silhouette score:', silhouette_score(X, labels))

# Quick plot
plt.scatter(X[:,0], X[:,1], c=labels, cmap='tab10', s=30)
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:,1], c='black', s=100, marker='X')
plt.title(f'k-means (k={k})')
plt.show()

Micro notes:

  • Use StandardScaler if features are on different scales — k-means is scale-sensitive.
  • init='k-means++' is recommended: smarter initialization to reduce bad local minima.
  • inertia_ is the k-means objective (lower is better for fixed k), but it decreases with k so don’t use it alone to pick k.

Choosing k: elbow, silhouette, and sanity checks

  • Elbow method: plot inertia versus k and look for a bend (the ‘elbow’). It’s heuristic and sometimes subtle.
  • Silhouette score: ranges from -1 to 1. Higher is better — it measures how similar a point is to its own cluster vs next best cluster.
  • Practical: also inspect cluster sizes, stability across different runs, and whether clusters make domain sense.

Why this matters: k is not discovered by k-means; it’s a hyperparameter you must propose. Use domain knowledge + diagnostics.


Strengths, weaknesses, and gotchas

Strengths

  • Fast and scalable for large datasets.
  • Simple, interpretable centroids.
  • Integrates well into pipelines for feature engineering (e.g., cluster id as a feature).

Weaknesses

  • Assumes spherical clusters of similar size — not great for elongated or density-varying clusters.
  • Sensitive to feature scaling and outliers.
  • Requires k up front.

Common mistakes

  • Forgetting to scale features (results skewed to large-magnitude features).
  • Using k-means on categorical features without embedding/encoding.
  • Trusting inertia to choose k blindly.

When to use k-means vs. other methods (practical decisions)

  • Use k-means when clusters are roughly spherical, numeric features are scaled, and you want a fast, baseline clustering.
  • Choose DBSCAN if you need to find arbitrarily shaped clusters and handle noise; it discovers the number of clusters by density.
  • Choose Gaussian Mixture Models when clusters overlap and you want soft (probabilistic) assignments.

Pro tip: if you used kNN and SVM earlier to classify, you can now try using labels from clustering as a feature or to perform semi-supervised learning. For instance, cluster IDs can capture coarse structure that a classifier can refine.


Real-world uses (so you don’t think this is just academic)

  • Customer segmentation (market baskets into groups)
  • Image compression (k-means colors = palette of k colors)
  • Anomaly detection (tiny clusters or outliers indicate anomalies)
  • Preprocessing for supervised models (add cluster ID as a categorical feature)

Quick checklist before running k-means

  • Numeric features? Scale them (StandardScaler or MinMax).
  • Try k-means++ initialization and multiple n_init.
  • Use elbow + silhouette + domain sense to pick k.
  • Visualize clusters whenever possible.
  • Consider alternatives if clusters are non-spherical or noisy.

Takeaways (the bits you'll tell your future self)

  • k-means clusters by minimizing intra-cluster squared distances; centroids are means.
  • It’s fast and useful but makes strong geometric assumptions (spherical, equal-size clusters, numeric scales).
  • Your stats intuition helps: the centroid is the L2-optimal point — that’s why k-means is linked to variance minimization.

This is the moment where the concept finally clicks: k-means is just repeatedly asking, “who’s nearest to my mean?” until the room stops rearranging itself.


If you want, I can: show a live notebook with elbow/silhouette plots, demonstrate k-means failure cases (elongated clusters, different densities), or walkthrough using cluster labels as features in a classification pipeline (bridging your kNN/SVM knowledge). Which would help you most next?

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics