Data Cleaning and Feature Engineering
Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.
Content
Feature Interactions and Polynomials
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Feature Interactions and Polynomials — When Features Date and Have Nonlinear Babies
"If a model could gossip, it would tell you: ‘Don't underestimate chemistry.’"
You already know how to clean data with pandas, bin continuous variables, and encode categories. Now we’re going to play matchmaker: we make features interact, let them multiply, square, and generally evolve into higher-order relationships that let models learn nonlinear effects without switching to a black-box neural net. This is Feature Interactions and Polynomial Features — the polite (or chaotic) way to capture relationships that aren’t strictly additive.
What this is, succinctly
- Feature interaction: create a new feature that is the product (or other combination) of two or more features — e.g., area × bedrooms to capture how extra bedrooms matter more in larger homes.
- Polynomial features: include powers like x^2, x^3, or cross-terms to let linear models represent curved relationships.
Why care? Because simple linear sums assume each feature acts independently. Real life rarely cooperates. Interactions let pairs/triples of features have their own effect.
When to use interactions and polynomials
- You suspect non-linear relationships (price accelerating with size).
- Domain knowledge suggests synergy (dose × enzyme concentration, marketing spend × seasonality).
- You want to keep a linear model but capture curvature.
When not to use: when you have too many features and not enough data (curse of dimensionality), or when interpretability and parsimony are paramount without good reason for more terms.
Quick examples with pandas and scikit-learn
Imagine a small housing dataset you loaded and cleaned with pandas (yes, keep that DataFrame hygiene from earlier lessons):
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
# toy df
df = pd.DataFrame({
'sqft': [800, 1200, 1500, 2000],
'bedrooms': [1, 2, 3, 4],
'age': [10, 5, 20, 2]
})
# manual interaction
df['sqft_x_bedrooms'] = df['sqft'] * df['bedrooms']
# polynomial (simple) by hand
df['sqft_sq'] = df['sqft'] ** 2
print(df)
Or use sklearn to generate all polynomial and interaction terms up to degree 2:
poly = PolynomialFeatures(degree=2, include_bias=False)
X = df[['sqft','bedrooms','age']]
X_poly = poly.fit_transform(X)
print(poly.get_feature_names_out(['sqft','bedrooms','age']))
Output features will include: sqft, bedrooms, age, sqft^2, sqft×bedrooms, … age^2.
Categorical × Numeric interactions (reference: encoding categories)
You learned encoding categorical variables earlier. Interactions between encoded dummies and numeric features are gold:
# suppose 'neighborhood' was one-hot encoded earlier
df = pd.get_dummies(df.assign(neighborhood=['A','B','A','B']), columns=['neighborhood'])
# multiply numeric by a dummy to get neighborhood-specific slopes
df['sqft_x_neigh_A'] = df['sqft'] * df['neighborhood_A']
Better: use sklearn's ColumnTransformer + Pipeline to keep this clean in a modeling workflow.
Practical pipeline: scaling → polynomial → regularize
Why scaling? Polynomial features blow up magnitudes and can cause numerical instability or multicollinearity. Centering (subtract mean) reduces correlation between x and x^2.
Example pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.compose import ColumnTransformer
num_cols = ['sqft','age']
pre = ColumnTransformer([
('num', Pipeline([('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False))]), num_cols),
('pass', 'passthrough', ['bedrooms'])
])
model = Pipeline([('pre', pre), ('ridge', Ridge(alpha=1.0))])
model.fit(X, y)
Ridge or Lasso help tame coefficients when polynomial/interactions inflate model complexity.
Pitfalls & how to avoid them
Combinatorial explosion: degree=3 on 20 features? Danger. Use domain knowledge to choose candidate interactions, or use interaction_only=True.
Multicollinearity: x and x^2 correlate. Center features or regularize (Ridge, ElasticNet).
Overfitting: validate with cross-validation. Use Lasso or tree-based models (which learn interactions implicitly) for selection.
Interpretability: interactions complicate coefficient stories. Use partial dependence plots or SHAP for model explanations.
How to pick which interactions to try (practical checklist)
- Start with domain knowledge — physics, economics, human intuition.
- Visualize: scatter plots colored by categories; residual plots vs features.
- Test few candidate interactions in CV and compare metric.
- If exploring many, use automatic selection: Lasso, forward selection, or tree ensembles to rank interactions.
Short comparison table
| Method | Good for | Downsides |
|---|---|---|
| Manual interactions (pandas) | Few, interpretable combinations | Labor & error-prone if many |
| PolynomialFeatures (sklearn) | Auto-generate many combos | Explosion in feature count |
| Tree-based models | Learn interactions automatically | Harder to interpret; might need more data |
Quick heuristics (thumb rules)
- If n_samples is small vs features, avoid high-degree polynomials.
- Center numeric features before raising to powers.
- Use interaction_only=True to limit to cross-terms if you don't need pure powers.
- Regularize aggressively if you add many terms.
Final, memorable insight
Think of original features as actors. Polynomial features let an actor deliver soliloquies (x^2), while interactions stage a duet where chemistry matters (x*y). A great script (domain knowledge + careful selection + regularization) keeps the play engaging instead of turning it into an expensive, incoherent Broadway flop.
Key takeaways
- Interactions capture synergy between features; polynomials capture curvature.
- Build interactions manually with pandas for targeted combos or use sklearn's PolynomialFeatures for broader coverage.
- Mitigate multicollinearity by centering/scaling and use regularization.
- Validate interactions via cross-validation and prefer parsimony: fewer, meaningful terms win.
Go forth and pair your features wisely — but don't forget to test whether their romance actually improves your model.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!