Machine Learning with scikit-learn
Build, tune, and evaluate models using scikit-learn pipelines with reproducible ML workflows.
Content
Linear and Logistic Regression
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Linear and Logistic Regression — scikit-learn Practical Guide
You're already comfortable with regression/classification metrics and the stats that make them meaningful. Now let's turn that intuition into models you can fit, interpret, and actually trust.
Why these two matter (fast)
- Linear Regression predicts a continuous outcome (house price, temperature). It's the classic: fit a line and measure how well it hugs the data (remember R², RMSE from Regression Metrics?).
- Logistic Regression predicts probabilities for a binary or multiclass outcome (spam/not spam). It gives you calibrated probabilities that feed right into metrics like ROC AUC and precision/recall (you've seen these in Classification Metrics).
Both are the foundation: simple, interpretable, and surprisingly powerful when used correctly.
Quick conceptual refresher (stat-tinged)
- Linear regression models the expected value: E[Y | X] = Xβ + ε. Think: "How does the average outcome shift when I change this feature by one unit?" — this ties directly back to statistical intuition about estimators and uncertainty.
- Logistic regression models the log-odds: logit(P(Y=1|X)) = Xβ. Coefficients change the log-odds, which you can exponentiate to get odds ratios: exp(β) — a friendlier effect size.
If you liked hypothesis testing and confidence intervals in the Stats module, note: scikit-learn focuses on prediction and regularization, not p-values. For inferential p-values use statsmodels. But scikit-learn gives robust tools for predictive modeling and cross-validated regularization.
Practical scikit-learn recipes (with code snippets)
1) Linear Regression (predict a continuous value)
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
X, y = make_regression(n_samples=500, n_features=10, noise=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = make_pipeline(StandardScaler(), Ridge(alpha=1.0)) # regularized linear model
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print('RMSE:', mean_squared_error(y_test, y_pred, squared=False))
print('R2:', r2_score(y_test, y_pred))
Tips:
- Use Ridge/ Lasso instead of plain LinearRegression when multicollinearity or overfitting is a concern.
- Standardize features when applying regularization (it makes coefficients comparable).
- To get coefficient estimates, access:
pipe.named_steps['ridge'].coef_.
2) Logistic Regression (class probabilities and decision boundary)
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
X, y = make_classification(n_samples=1000, n_features=10, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = make_pipeline(StandardScaler(), LogisticRegression(C=1.0, penalty='l2', solver='liblinear'))
clf.fit(X_train, y_train)
y_prob = clf.predict_proba(X_test)[:,1]
print('ROC AUC:', roc_auc_score(y_test, y_prob))
print(classification_report(y_test, clf.predict(X_test)))
Notes:
- LogisticRegression in scikit-learn uses L2 regularization by default (penalty). The parameter C is inverse regularization strength (smaller C → stronger penalty).
- For multiclass, scikit-learn uses one-vs-rest or softmax (multinomial) depending on
multi_classandsolver. - Coefficients are interpretable as log-odds effects. Convert to odds ratio with
np.exp(coef).
Interpreting coefficients — short, human-friendly
- Linear: β_j ≈ change in Y for a one-unit increase in X_j (holding others constant). If features are standardized, β magnitudes reflect relative importance.
- Logistic: β_j ≈ change in log-odds; exp(β_j) = multiplicative change in odds. Example: exp(0.7) ≈ 2.01 → the odds double.
Remember: correlation ≠ causation — coefficients reflect associations given the data and model assumptions.
Practical gotchas & remedies
- Scaling: Always scale before Ridge/Lasso/Logistic! (StandardScaler in a pipeline is your friend.)
- Multicollinearity: Inflates coefficient variance. Use Ridge or drop/recombine features. Use VIF from statsmodels if inference matters.
- Class imbalance: For logistic, consider
class_weight='balanced', resampling, or use precision-recall metrics rather than accuracy. - Calibration: Probabilities can be miscalibrated. Use
CalibratedClassifierCVif you need trustworthy probabilities for decision-making. - Overfitting: Use cross-validation for alpha/C. grid-search with pipelines.
Connecting this to metrics & statistics you already know
- Regression metrics (RMSE, MAE, R²) tell you how well your linear model predicts continuous targets — check them after model selection and cross-validation.
- Classification metrics (accuracy, precision, recall, ROC AUC) evaluate logistic models; when classes are imbalanced prefer ROC AUC and precision-recall curves.
- From the stats module: the notion of uncertainty (standard errors, CI) applies — scikit-learn doesn't give p-values, but bootstrapping or statsmodels can help if you need inference.
Quick checklist before you ship a model
- Feature scaling (yes/no?) — yes for regularization.
- Cross-validate hyperparameters (alpha / C).
- Check residuals (linear) for heteroscedasticity / nonlinearity.
- Check calibration & discriminative metrics (logistic): ROC AUC, precision/recall.
- Interpret coefficients sensibly; convert to odds ratios for logistic.
- If you need inference (CI/p-values), use statsmodels or bootstrap.
Key takeaways (stick this on your forehead)
- Linear = predict numbers, interpret β as unit-change; use Ridge/Lasso to tame overfitting.
- Logistic = predict probabilities, interpret β via log-odds → odds ratios; regularization helps generalization.
- Always scale, cross-validate, and match your evaluation metrics to the problem (regression vs classification). Use your stats intuition about uncertainty when making decisions.
"Simple models + rigorous evaluation beat complicated models + wishful thinking."
Now go fit a model, check your metrics from the previous lessons, and come back with the weirdest coefficient you found. I promise we'll roast it together (kindly).
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!