Data Cleaning and Feature Engineering
Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.
Content
Scaling and Normalization
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Scaling and Normalization — Making Features Play Nicely Together
You already dealt with missing values (imputation) and smacked down the outliers. Nice. Now imagine you’ve invited features to a party: some show up wearing stilts, others are toddlers. Scaling is the bouncer who makes everyone appear the same height so they stop stealing the limelight.
"Scaling doesn't make features better — it makes them comparable."
Why scaling/normalization matters (quick reminder)
- Many ML algorithms assume features are on similar scales: KNN, K-means, SVM, logistic regression with gradient descent, neural networks, PCA. If one feature ranges 0–1 and another 0–1,000,000, the latter dominates distances and gradients.
- Tree-based models (random forest, XGBoost) are mostly scale-invariant — they split on thresholds, so scaling rarely changes performance.
- Always scale after imputation and outlier handling, and remember: fit only on training data to avoid data leakage.
Common scalers and when to use them
1) Standardization (Z-score)
What: subtract mean and divide by standard deviation → mean≈0, std≈1.
Use when: features are roughly Gaussian or you want zero-centered data. Good for algorithms assuming normality or using dot-products (SVMs, logistic regression, neural nets).
sklearn: StandardScaler
2) Min–Max Scaling (Normalization)
What: scales to [0,1] (or other range) using (x - min) / (max - min).
Use when: you need bounded features (e.g., image pixels), or algorithms that require positive inputs. Be careful with outliers — they compress the remaining data.
sklearn: MinMaxScaler
3) Robust Scaling
What: subtract median and divide by IQR (interquartile range).
Use when: your data contains outliers you already detected but prefer a method resilient to them. Great if outliers weren't removed but you don't want them to dominate.
sklearn: RobustScaler
4) Max-Abs Scaling
What: divides by maximum absolute value, scales to [-1,1]. Preserves sparsity.
Use when: data is sparse (e.g., TF-IDF). Useful for linear models working with sparse matrices.
sklearn: MaxAbsScaler
5) Unit Vector Scaling (Normalizer)
What: scales each sample (row) to unit norm (L1 or L2). Not feature-wise; it operates on rows.
Use when: you care about direction in feature space (cosine similarity), e.g., text embeddings.
sklearn: Normalizer
6) Power & Quantile Transforms
PowerTransformer (Box-Cox or Yeo-Johnson): makes distributions more Gaussian. Useful before scaling if features are skewed.
QuantileTransformer: maps data to a uniform or normal distribution using ranks; robust to outliers and nonlinear.
Use when: you need to stabilize variance or make feature distributions approximately Gaussian (helpful for models sensitive to normality).
Quick analogies (so it sticks)
- Min–Max: "Squeeze into the same onesie." Everyone ends at the same endpoints.
- StandardScaler: "Center on the stage and adjust volume." Mean zero, comparable amplitude.
- RobustScaler: "Ignore the loudest friend and talk about the group median." Outlier-resistant.
- Normalizer: "Normalize each sentence to its direction — we care about wording pattern, not length."
Practical: Implementing scalers with sklearn and pandas
Assume you have a DataFrame df with numerical columns numeric_cols and categorical_cols previously imputed and cleaned.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_cols = ['age', 'income', 'num_items']
cat_cols = ['country', 'device']
numeric_pipeline = Pipeline([
('scaler', StandardScaler())
])
preprocessor = ColumnTransformer([
('num', numeric_pipeline, numeric_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])
# Then wrap with model pipeline
from sklearn.linear_model import LogisticRegression
model = Pipeline([
('preproc', preprocessor),
('clf', LogisticRegression())
])
model.fit(X_train, y_train)
Key points:
- Fit scalers only on training data (Pipeline/ColumnTransformer ensures that during cross-validation).
- Use appropriate scaler for sparse data (MaxAbsScaler or avoid dense conversions).
Pitfalls & best practices
- Data leakage: never fit a scaler on the full dataset. Always fit on train and transform test/validation.
- Order matters: impute missing values and handle outliers before scaling. For example, if you replace missing with median, do that first, then scale.
- Polynomials & interaction terms: if you create polynomial features, scale after generating them (or use a pipeline step to do polynomial feature generation then scaling). Otherwise, feature magnitudes explode.
- Interpretability: scaling changes feature units. Keep track (save your scaler) and use inverse_transform for interpreting coefficients or predictions.
- Sparse features: many scalers convert data to dense arrays. Use MaxAbsScaler or sparse-aware implementations when working with large sparse matrices.
- Tree-based models: usually don't require scaling, but if you plan to use the same pipeline for many models (some requiring scaling), include it conditionally.
Example: Avoiding leakage in cross-validation
Bad: fit scaler on entire dataset, then CV → inflated performance.
Good: include scaler inside Pipeline. Scikit-learn's cross_val_score will call fit on the training fold only, preventing leakage.
from sklearn.model_selection import cross_val_score
pipeline = Pipeline([('scaler', StandardScaler()), ('clf', KNeighborsClassifier())])
cross_val_score(pipeline, X, y, cv=5)
When scaling won't help much
- Decision trees and ensembles of trees (unless you combine with algorithms that care about scale).
- When a feature's absolute scale is semantically meaningful and you don't want to lose that unless your model requires it.
Quick decision cheat-sheet
- Skewed numeric features → consider PowerTransformer or log transform then StandardScaler.
- Outliers present → RobustScaler.
- Sparse features → MaxAbsScaler.
- Need bounded 0–1 → MinMaxScaler (watch out for outliers).
- Row-wise similarity (cosine) → Normalizer.
Closing: Summary & takeaways
- Scaling makes feature magnitudes comparable; it's essential for distance-based and gradient-based algorithms.
- Fit scalers only on training data and include them inside Pipelines to prevent leakage.
- Choose the scaler to match your data: use RobustScaler for outliers, StandardScaler for roughly normal data, MinMax for bounded ranges, MaxAbs for sparse data, and Power/Quantile transforms for heavy skew.
- Keep a saved scaler for inverse_transform so your results remain interpretable.
"Scale wisely. Fit on training. Transform everywhere else. Then go build something that actually amazes people."
Suggested next steps in this course
- Apply StandardScaler vs RobustScaler to a dataset you cleaned earlier (after imputation & outlier steps) and compare model performance for KNN and RandomForest.
- Practice building ColumnTransformer pipelines combining numeric scalers and OneHotEncoder for categorical features.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!