Courses/Python for Data Science, AI & Development/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

43380 views

Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.

Content

3 of 15

Scaling and Normalization

Scaling and Normalization in Python for Data Science & AI

8942 views

intermediate

python

data-science

feature-engineering

sklearn

gpt-5-mini

8942 views

Versions:

Scaling and Normalization in Python for Data Science & AI

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Scaling and Normalization — Making Features Play Nicely Together

You already dealt with missing values (imputation) and smacked down the outliers. Nice. Now imagine you’ve invited features to a party: some show up wearing stilts, others are toddlers. Scaling is the bouncer who makes everyone appear the same height so they stop stealing the limelight.

"Scaling doesn't make features better — it makes them comparable."

Why scaling/normalization matters (quick reminder)

Many ML algorithms assume features are on similar scales: KNN, K-means, SVM, logistic regression with gradient descent, neural networks, PCA. If one feature ranges 0–1 and another 0–1,000,000, the latter dominates distances and gradients.
Tree-based models (random forest, XGBoost) are mostly scale-invariant — they split on thresholds, so scaling rarely changes performance.
Always scale after imputation and outlier handling, and remember: fit only on training data to avoid data leakage.

Common scalers and when to use them

1) Standardization (Z-score)

What: subtract mean and divide by standard deviation → mean≈0, std≈1.
Use when: features are roughly Gaussian or you want zero-centered data. Good for algorithms assuming normality or using dot-products (SVMs, logistic regression, neural nets).
sklearn: StandardScaler

2) Min–Max Scaling (Normalization)

What: scales to [0,1] (or other range) using (x - min) / (max - min).
Use when: you need bounded features (e.g., image pixels), or algorithms that require positive inputs. Be careful with outliers — they compress the remaining data.
sklearn: MinMaxScaler

3) Robust Scaling

What: subtract median and divide by IQR (interquartile range).
Use when: your data contains outliers you already detected but prefer a method resilient to them. Great if outliers weren't removed but you don't want them to dominate.
sklearn: RobustScaler

4) Max-Abs Scaling

What: divides by maximum absolute value, scales to [-1,1]. Preserves sparsity.
Use when: data is sparse (e.g., TF-IDF). Useful for linear models working with sparse matrices.
sklearn: MaxAbsScaler

5) Unit Vector Scaling (Normalizer)

What: scales each sample (row) to unit norm (L1 or L2). Not feature-wise; it operates on rows.
Use when: you care about direction in feature space (cosine similarity), e.g., text embeddings.
sklearn: Normalizer

6) Power & Quantile Transforms

PowerTransformer (Box-Cox or Yeo-Johnson): makes distributions more Gaussian. Useful before scaling if features are skewed.
QuantileTransformer: maps data to a uniform or normal distribution using ranks; robust to outliers and nonlinear.
Use when: you need to stabilize variance or make feature distributions approximately Gaussian (helpful for models sensitive to normality).

Quick analogies (so it sticks)

Min–Max: "Squeeze into the same onesie." Everyone ends at the same endpoints.
StandardScaler: "Center on the stage and adjust volume." Mean zero, comparable amplitude.
RobustScaler: "Ignore the loudest friend and talk about the group median." Outlier-resistant.
Normalizer: "Normalize each sentence to its direction — we care about wording pattern, not length."

Practical: Implementing scalers with sklearn and pandas

Assume you have a DataFrame df with numerical columns numeric_cols and categorical_cols previously imputed and cleaned.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_cols = ['age', 'income', 'num_items']
cat_cols = ['country', 'device']

numeric_pipeline = Pipeline([
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

# Then wrap with model pipeline
from sklearn.linear_model import LogisticRegression
model = Pipeline([
    ('preproc', preprocessor),
    ('clf', LogisticRegression())
])

model.fit(X_train, y_train)

Key points:

Fit scalers only on training data (Pipeline/ColumnTransformer ensures that during cross-validation).
Use appropriate scaler for sparse data (MaxAbsScaler or avoid dense conversions).

Pitfalls & best practices

Data leakage: never fit a scaler on the full dataset. Always fit on train and transform test/validation.
Order matters: impute missing values and handle outliers before scaling. For example, if you replace missing with median, do that first, then scale.
Polynomials & interaction terms: if you create polynomial features, scale after generating them (or use a pipeline step to do polynomial feature generation then scaling). Otherwise, feature magnitudes explode.
Interpretability: scaling changes feature units. Keep track (save your scaler) and use inverse_transform for interpreting coefficients or predictions.
Sparse features: many scalers convert data to dense arrays. Use MaxAbsScaler or sparse-aware implementations when working with large sparse matrices.
Tree-based models: usually don't require scaling, but if you plan to use the same pipeline for many models (some requiring scaling), include it conditionally.

Example: Avoiding leakage in cross-validation

Bad: fit scaler on entire dataset, then CV → inflated performance.

Good: include scaler inside Pipeline. Scikit-learn's cross_val_score will call fit on the training fold only, preventing leakage.

from sklearn.model_selection import cross_val_score
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', KNeighborsClassifier())])
cross_val_score(pipeline, X, y, cv=5)

When scaling won't help much

Decision trees and ensembles of trees (unless you combine with algorithms that care about scale).
When a feature's absolute scale is semantically meaningful and you don't want to lose that unless your model requires it.

Quick decision cheat-sheet

Skewed numeric features → consider PowerTransformer or log transform then StandardScaler.
Outliers present → RobustScaler.
Sparse features → MaxAbsScaler.
Need bounded 0–1 → MinMaxScaler (watch out for outliers).
Row-wise similarity (cosine) → Normalizer.

Closing: Summary & takeaways

Scaling makes feature magnitudes comparable; it's essential for distance-based and gradient-based algorithms.
Fit scalers only on training data and include them inside Pipelines to prevent leakage.
Choose the scaler to match your data: use RobustScaler for outliers, StandardScaler for roughly normal data, MinMax for bounded ranges, MaxAbs for sparse data, and Power/Quantile transforms for heavy skew.
Keep a saved scaler for inverse_transform so your results remain interpretable.

"Scale wisely. Fit on training. Transform everywhere else. Then go build something that actually amazes people."

Suggested next steps in this course

Apply StandardScaler vs RobustScaler to a dataset you cleaned earlier (after imputation & outlier steps) and compare model performance for KNN and RandomForest.
Practice building ColumnTransformer pipelines combining numeric scalers and OneHotEncoder for categorical features.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics