Statistics and Probability for Data Science
Develop statistical intuition for inference, experimentation, and uncertainty-aware decisions.
Content
Probability Distributions
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Probability Distributions — The Data Scientist’s Map of Uncertainty
"If descriptive statistics are the snapshot, probability distributions are the movie script of how the data could behave."
You already know how to compute a mean, median, and standard deviation from our Descriptive Statistics module. You’ve also learned to show uncertainty with visual tools in Data Visualization and Storytelling (remember shaded confidence bands, error bars, and exporting publication-ready figures?). Now let's connect those skills to the backbone of probabilistic thinking: probability distributions.
What is a probability distribution? (Short, practical definition)
- Probability distribution: a function that tells you how probability mass or density is assigned across possible outcomes.
- Why it matters: it lets you answer questions like "What’s the chance of getting at least 10 successes?" or "How likely is an observation to be 2 standard deviations above the mean?" — foundational when you communicate uncertainty.
Quick taxonomy
- Discrete distributions (e.g., Binomial, Poisson): probabilities for countable outcomes.
- Continuous distributions (e.g., Normal, Exponential): densities; probabilities are areas under the curve.
Why data scientists care (real-world reasons)
- Predictive models often assume an error distribution (e.g., normality of residuals). If assumptions break, inference and predictions mislead.
- Choosing a likelihood (distribution) is central to Bayesian modeling and simulation.
- Simulating data for experiments, A/B tests, or uncertainty quantification requires sampling from appropriate distributions.
Imagine you're a weather modeler: descriptive stats give you yesterday’s average temp; distributions let you say "there's a 30% chance of rain between 3–6pm" and plot that shaded probability so stakeholders actually pay attention.
Core distributions you will meet (and when to use them)
1) Normal (Gaussian)
- Use when outcomes cluster around a mean with symmetric variability.
- Characterized by mean μ and variance σ².
- PDF (intuitively): bell-shaped curve; area under the curve between two points = probability.
When to use: residuals in many regression models, measurement errors.
2) Binomial
- Discrete: number of successes in n independent trials with success prob p.
- Use for yes/no repeated experiments (A/B tests, click/no-click, defective/non-defective).
3) Poisson
- Discrete: counts of rare events per fixed interval (calls per hour, arrivals at a queue).
- Parameter λ = expected count per interval.
4) Exponential
- Continuous: time between independent events (memoryless property).
- Use for modeling waiting times (e.g., time until next failure).
5) Uniform
- All outcomes in an interval are equally likely. Useful for priors and baseline simulations.
(There are many more — Student’s t, Beta, Gamma — each with its role; t is robust for small-sample inference, Beta is great for probabilities between 0 and 1.)
From descriptive stats to distributions: an example
Recall how you computed mean and variance. Now imagine you have sample mean x̄ = 50 and sample std s = 5. If the data are roughly symmetric, you might model the population with a Normal(μ=50, σ=5). That immediately lets you compute probabilities and plot uncertainty.
Python snippet — visualize PDF and CDF (build on your plotting skills)
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, binom, poisson
# Normal example
mu, sigma = 50, 5
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 400)
pdf = norm.pdf(x, mu, sigma)
cdf = norm.cdf(x, mu, sigma)
plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.plot(x, pdf, label='PDF')
plt.fill_between(x, pdf, where=(x>55)&(x<60), color='C0', alpha=0.3, label='P(55<x<60)')
plt.legend(); plt.title('Normal PDF')
plt.subplot(1,2,2)
plt.plot(x, cdf, label='CDF')
plt.axvline(55, color='C1', linestyle='--'); plt.title('Normal CDF')
plt.tight_layout()
plt.savefig('normal_distribution.png', dpi=150)
Notes: use the PDF to visualize density (great for publication plots). Use the CDF to answer direct probability questions like P(X <= x). Don’t forget plt.savefig when exporting figures for reports — you’ve done that before.
Practice interpretation: shaded areas are your confidence friends
When you plot a normal curve and shade the area between two points, you are literally showing probability mass (or, in frequentist terms, how likely a randomly drawn observation is to land there). This is the same visual language you used when communicating uncertainty — shaded confidence bands, posterior predictive intervals, etc.
Tip: Annotate the shaded area with the numeric probability (e.g., 0.27) so viewers who skip reading the caption still get the message.
When assumptions break — quick checks and alternatives
- Normality fails with heavy tails → consider Student’s t.
- Count data with overdispersion (variance > mean) → prefer Negative Binomial over Poisson.
- Probabilities bounded [0,1] with lots of zeros/ones → consider Beta or zero-inflated models.
Always visualize residuals (histogram + Q-Q plot) and compare empirical distribution to the fitted theoretical one — your visualization skills will save lots of statistical mourning.
Quick glossary (micro explanations)
- PDF: Probability Density Function — density for continuous variables; area = probability.
- PMF: Probability Mass Function — probabilities for discrete outcomes.
- CDF: Cumulative Distribution Function — probability X ≤ x.
- Parameter: number like μ, σ, λ that defines the distribution.
Closing: key takeaways (what you should remember)
- Distributions are models of uncertainty. They let you quantify and visualize how likely outcomes are.
- Match distribution to problem type: counts → Poisson/Binomial, continuous symmetrical → Normal, waiting times → Exponential.
- Use plots to communicate probabilities. Shade areas, add numerical annotations, and export clean figures (you’ve practiced that).
- Check assumptions visually and statistically. When they fail, pick alternatives rather than forcing a Normal where it doesn’t belong.
"Probability distributions turn vague intuition into numbers and pictures people can argue about — which is progress."
Next step (logical progression): practice sampling and bootstrapping from these distributions in Python, then overlay empirical histograms and theoretical curves — that's where descriptive statistics and visualization converge into powerful model checking.
Further reading / exercises
- Simulate 10,000 draws from a Poisson(λ=3) and compare histogram to PMF.
- Fit a Normal vs Student’s t to a heavy-tailed dataset; compare Q-Q plots.
- Create a publication-ready figure showing a CDF with shaded tail probability and save it as SVG and PNG.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!