Courses/Python for Data Science, AI & Development/Machine Learning with scikit-learn

Machine Learning with scikit-learn

44941 views

Build, tune, and evaluate models using scikit-learn pipelines with reproducible ML workflows.

Content

5 of 15

Linear and Logistic Regression

Linear and Logistic Regression in scikit-learn (Practical Guide)

8244 views

beginner

humorous

python

machine-learning

scikit-learn

gpt-5-mini

8244 views

Versions:

Linear and Logistic Regression in scikit-learn (Practical Guide)

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Linear and Logistic Regression — scikit-learn Practical Guide

You're already comfortable with regression/classification metrics and the stats that make them meaningful. Now let's turn that intuition into models you can fit, interpret, and actually trust.

Why these two matter (fast)

Linear Regression predicts a continuous outcome (house price, temperature). It's the classic: fit a line and measure how well it hugs the data (remember R², RMSE from Regression Metrics?).
Logistic Regression predicts probabilities for a binary or multiclass outcome (spam/not spam). It gives you calibrated probabilities that feed right into metrics like ROC AUC and precision/recall (you've seen these in Classification Metrics).

Both are the foundation: simple, interpretable, and surprisingly powerful when used correctly.

Quick conceptual refresher (stat-tinged)

Linear regression models the expected value: E[Y | X] = Xβ + ε. Think: "How does the average outcome shift when I change this feature by one unit?" — this ties directly back to statistical intuition about estimators and uncertainty.
Logistic regression models the log-odds: logit(P(Y=1|X)) = Xβ. Coefficients change the log-odds, which you can exponentiate to get odds ratios: exp(β) — a friendlier effect size.

If you liked hypothesis testing and confidence intervals in the Stats module, note: scikit-learn focuses on prediction and regularization, not p-values. For inferential p-values use statsmodels. But scikit-learn gives robust tools for predictive modeling and cross-validated regularization.

Practical scikit-learn recipes (with code snippets)

1) Linear Regression (predict a continuous value)

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

X, y = make_regression(n_samples=500, n_features=10, noise=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = make_pipeline(StandardScaler(), Ridge(alpha=1.0))  # regularized linear model
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
print('RMSE:', mean_squared_error(y_test, y_pred, squared=False))
print('R2:', r2_score(y_test, y_pred))

Tips:

Use Ridge/ Lasso instead of plain LinearRegression when multicollinearity or overfitting is a concern.
Standardize features when applying regularization (it makes coefficients comparable).
To get coefficient estimates, access: pipe.named_steps['ridge'].coef_.

2) Logistic Regression (class probabilities and decision boundary)

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report

X, y = make_classification(n_samples=1000, n_features=10, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = make_pipeline(StandardScaler(), LogisticRegression(C=1.0, penalty='l2', solver='liblinear'))
clf.fit(X_train, y_train)

y_prob = clf.predict_proba(X_test)[:,1]
print('ROC AUC:', roc_auc_score(y_test, y_prob))
print(classification_report(y_test, clf.predict(X_test)))

Notes:

LogisticRegression in scikit-learn uses L2 regularization by default (penalty). The parameter C is inverse regularization strength (smaller C → stronger penalty).
For multiclass, scikit-learn uses one-vs-rest or softmax (multinomial) depending on multi_class and solver.
Coefficients are interpretable as log-odds effects. Convert to odds ratio with np.exp(coef).

Interpreting coefficients — short, human-friendly

Linear: β_j ≈ change in Y for a one-unit increase in X_j (holding others constant). If features are standardized, β magnitudes reflect relative importance.
Logistic: β_j ≈ change in log-odds; exp(β_j) = multiplicative change in odds. Example: exp(0.7) ≈ 2.01 → the odds double.

Remember: correlation ≠ causation — coefficients reflect associations given the data and model assumptions.

Practical gotchas & remedies

Scaling: Always scale before Ridge/Lasso/Logistic! (StandardScaler in a pipeline is your friend.)
Multicollinearity: Inflates coefficient variance. Use Ridge or drop/recombine features. Use VIF from statsmodels if inference matters.
Class imbalance: For logistic, consider class_weight='balanced', resampling, or use precision-recall metrics rather than accuracy.
Calibration: Probabilities can be miscalibrated. Use CalibratedClassifierCV if you need trustworthy probabilities for decision-making.
Overfitting: Use cross-validation for alpha/C. grid-search with pipelines.

Connecting this to metrics & statistics you already know

Regression metrics (RMSE, MAE, R²) tell you how well your linear model predicts continuous targets — check them after model selection and cross-validation.
Classification metrics (accuracy, precision, recall, ROC AUC) evaluate logistic models; when classes are imbalanced prefer ROC AUC and precision-recall curves.
From the stats module: the notion of uncertainty (standard errors, CI) applies — scikit-learn doesn't give p-values, but bootstrapping or statsmodels can help if you need inference.

Quick checklist before you ship a model

Feature scaling (yes/no?) — yes for regularization.
Cross-validate hyperparameters (alpha / C).
Check residuals (linear) for heteroscedasticity / nonlinearity.
Check calibration & discriminative metrics (logistic): ROC AUC, precision/recall.
Interpret coefficients sensibly; convert to odds ratios for logistic.
If you need inference (CI/p-values), use statsmodels or bootstrap.

Key takeaways (stick this on your forehead)

Linear = predict numbers, interpret β as unit-change; use Ridge/Lasso to tame overfitting.
Logistic = predict probabilities, interpret β via log-odds → odds ratios; regularization helps generalization.
Always scale, cross-validate, and match your evaluation metrics to the problem (regression vs classification). Use your stats intuition about uncertainty when making decisions.

"Simple models + rigorous evaluation beat complicated models + wishful thinking."

Now go fit a model, check your metrics from the previous lessons, and come back with the weirdest coefficient you found. I promise we'll roast it together (kindly).

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics