Statistics and Probability for Data Science
Develop statistical intuition for inference, experimentation, and uncertainty-aware decisions.
Content
Regression Fundamentals
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Regression Fundamentals — The Best Line That Tells a Story
You already learned how covariance and correlation whisper about relationships between variables. Now regression barges in and yells, "Hold my beer — I can predict that."
Regression is where statistics stops gossiping and starts making forecasts. If correlation told you that ice cream sales and temperature move together, regression gives you the actual equation to predict sales from temperature (and maybe blame for your sunburn).
What regression actually is
Regression estimates a functional relationship between an outcome variable (dependent variable, y) and one or more predictors (independent variables, X). The simplest and most common is linear regression:
y = beta_0 + beta_1 * x + epsilon
- beta_0 is the intercept
- beta_1 is the slope (effect size)
- epsilon is the residual or noise
Think of linear regression as finding the best straight line through a cloud of points — best in the least-squares sense, meaning it minimizes the sum of squared residuals.
Why this matters for data science
- Prediction: estimate future values or missing values
- Inference: quantify relationships and test hypotheses
- Feature understanding: which variables move y the most?
This builds on what you saw with correlation and covariance: correlation measures strength and direction, but regression gives a quantitative relationship and allows prediction.
Quick analogy: Regression as a barista
Imagine y is your coffee level and x is the size of your order. Correlation tells you whether bigger orders mean more coffee. Regression gives the recipe: how many ounces per order size, plus the starting base pour. Residuals are underfilled cups and bad barista days.
Linear regression details (the stuff you actually need)
Estimating coefficients
Least squares solution for simple linear regression has closed form:
beta_1 = covariance(x, y) / variance(x)
beta_0 = mean(y) - beta_1 * mean(x)
So yes, the correlation and covariance you already learned directly show up here.
Interpreting coefficients
- beta_1 = expected change in y per unit increase in x, holding other variables constant
- Sign tells direction; magnitude shows strength
Goodness of fit
- R-squared: proportion of variance in y explained by the model
- Adjusted R-squared: penalizes adding useless predictors
Remember: high R-squared is not magically correct. It can be high for the wrong reasons (overfitting, nonlinearity, or leakage).
Assumptions of ordinary least squares (OLS)
To trust coefficient estimates and standard errors, the usual assumptions are:
- Linearity: relationship between X and y is linear
- Independence: observations are independent
- Homoscedasticity: residuals have constant variance
- Normality: residuals are approximately normal (for inference)
- No perfect multicollinearity: predictors are not exact linear combos
If these sound like a checklist from a math cult, that is because they are. Violations mean different tools or diagnostics are needed.
When assumptions break
- Nonlinear patterns -> try polynomial regression, splines, or tree-based models
- Heteroscedasticity -> use weighted least squares or robust standard errors
- Non-normal residuals -> bootstrapping for inference
- Multicollinearity -> check VIF, remove or combine correlated features
- Nonparametric tests you studied earlier can guide when parametric assumptions fail; similarly, nonparametric regression or robust methods step in when OLS fails
Diagnostics and visualization (builds on Data Visualization and Storytelling)
You covered Matplotlib, Seaborn, and Plotly earlier. Use them here to avoid disaster:
- Scatterplot with regression line to check linearity
- Residuals vs fitted values to check homoscedasticity
- Q-Q plot of residuals to check normality
- Leverage and Cook's distance to find influential points
Quote for the ages:
"If your residual plot looks like a banana, your model is wrong."
Example: draw a scatterplot with Seaborn lmplot or regplot for quick checks, then inspect residuals with plt.scatter(fitted, residuals).
Multivariable regression and multicollinearity
When you have multiple predictors, the model is:
y = beta_0 + beta_1 x1 + beta_2 x2 + ... + epsilon
Interpretation is conditional: beta_1 is the effect of x1 holding x2, x3 constant.
Multicollinearity occurs when predictors correlate strongly with each other. It inflates standard errors and makes coefficients unstable.
Quick fix checklist:
- Compute Variance Inflation Factor (VIF)
- Drop or combine correlated features
- Use regularization (Ridge, Lasso)
Overfitting, underfitting, and the bias-variance tradeoff
- Underfitting: model too simple, high bias
- Overfitting: model too complex, high variance
Use cross-validation to estimate out-of-sample performance. Regularization methods (Ridge, Lasso) are your friends when you have many noisy predictors.
Quick Python example (scikit-learn)
# simple linear regression with scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
print('Intercept', model.intercept_)
print('Coefficients', model.coef_)
# evaluate
r2 = model.score(X_test, y_test)
print('R-squared on test set', r2)
For detailed inference (p-values, standard errors), use statsmodels OLS which provides a full summary table.
When to use regression vs other tools
- Use regression when you want interpretable relationships and decent predictive performance for roughly linear relationships.
- Use tree-based models or neural nets for complex nonlinear patterns or interactions that are hard to specify.
- Use nonparametric or robust regression when assumptions fail — this connects to the nonparametric tests topic you already covered.
Key takeaways
- Regression moves from correlation to prediction and quantification: it gives you an equation, not just a wink.
- Always visualize residuals and diagnostics — your plots will catch problems before your metrics lie to you.
- Check assumptions and use alternatives when they are violated: robust methods, nonparametric approaches, or regularized models.
- Interpret coefficients as conditional effects and be wary of multicollinearity.
Final memorable image:
Think of regression as hiring a valet to park your data in a straight line. Sometimes the valet is perfect; sometimes the car is a clown car (outliers), and sometimes the parking lot is curved (nonlinearity). Your job is to inspect, test, and pick the right valet or tool.
If you want, I can add a short notebook that visualizes residuals, computes VIF, and shows how Ridge/Lasso tame wild coefficients. Say the word and I will summon code and plots using seaborn and statsmodels.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!