Statistics and Probability for Data Science
Develop statistical intuition for inference, experimentation, and uncertainty-aware decisions.
Content
Confidence Intervals
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Confidence Intervals — What They Are, How to Compute Them, and How to Tell If Your Results Aren't Lying to You
"Confidence intervals are like zone defense for your estimate: they say where the true value probably hangs out — not with absolute swag, but with quantified humility."
You've just come off Sampling and the Central Limit Theorem and peeked at Hypothesis Testing, and you've been plotting everything with Matplotlib/Seaborn. Good — you're primed. Confidence intervals (CIs) are the natural next step: they take the sampling distribution ideas from CLT, the decision framing from hypothesis testing, and the visual flair from your plots, and turn a single-point estimate into a full story.
What is a Confidence Interval (briefly)?
- Definition (intuitively): A confidence interval gives a range of plausible values for a population parameter (mean, proportion, etc.) based on sample data.
- Formal-ish: A 95% CI for a parameter means that if you repeated the sampling process many times and built a CI from each sample in the same way, about 95% of those intervals would contain the true parameter.
Crucial nuance: A 95% CI doesn't mean there's a 95% probability that the parameter is in this one interval — the parameter is fixed; the interval is random.
Why CIs matter for data science
- They show uncertainty, not just a point estimate (mean, proportion). This helps avoid overconfident claims.
- They directly tie to hypothesis testing: if your null value (e.g., μ0) lies outside a 95% CI, you would reject a two-sided test at α = 0.05.
- They’re essential for communicating results visually: error bars, forest plots, and dashboards become effective storytelling tools.
Basic formulas (quick reference)
For a population mean (known σ — theoretical):
x̄ ± z* * (σ / √n)
For a population mean (σ unknown — practical):
x̄ ± t* * (s / √n)
- z* is the critical value from the standard normal (e.g., 1.96 for 95%).
- t* is from Student’s t-distribution with df = n − 1.
For a population proportion:
p̂ ± z* * √(p̂(1 − p̂) / n)
When to use z vs t
- Use z when the population standard deviation σ is known (rare in practice).
- Use t when σ is unknown and you estimate it with sample s — especially for small n. As n grows, the t-distribution approaches the normal distribution.
Hands-on Python examples (compute and plot)
- A single-sample CI for a mean (t-based):
import numpy as np
from scipy import stats
# Simulate sample
np.random.seed(0)
sample = np.random.normal(loc=5.0, scale=2.0, size=30)
n = len(sample)
xbar = sample.mean()
s = sample.std(ddof=1)
alpha = 0.05
# t critical
t_crit = stats.t.ppf(1 - alpha/2, df=n-1)
margin = t_crit * s / np.sqrt(n)
ci_lower = xbar - margin
ci_upper = xbar + margin
print(f"Mean={xbar:.3f}, 95% CI=({ci_lower:.3f}, {ci_upper:.3f})")
- Visual intuition: draw many samples, plot their 95% CIs and show coverage
import matplotlib.pyplot as plt
np.random.seed(1)
true_mu = 5.0
n = 25
trials = 40
cis = []
contains = []
for i in range(trials):
s = np.random.normal(loc=true_mu, scale=2.0, size=n)
xbar = s.mean(); sd = s.std(ddof=1)
t_crit = stats.t.ppf(0.975, df=n-1)
m = t_crit * sd / np.sqrt(n)
cis.append((xbar - m, xbar + m))
contains.append((xbar - m) <= true_mu <= (xbar + m))
# plot
plt.figure(figsize=(8,6))
for i, (ci, ok) in enumerate(zip(cis, contains)):
color = 'green' if ok else 'red'
plt.plot(ci, [i, i], color=color, lw=2)
plt.plot(( (ci[0]+ci[1])/2,), (i,), 'o', color='black')
plt.axvline(true_mu, color='blue', linestyle='--', label='True mean')
plt.xlabel('Value'); plt.ylabel('Sample index')
plt.title('Many 95% CIs — green contains true mean, red does not')
plt.legend()
plt.show()
This visualization is gold: it uses your plotting skills and gives a visceral sense of coverage probability — the frequency with which CIs actually contain the true parameter.
Interpreting CIs — common pitfalls
- Wrong: "There is a 95% probability the true mean is in this interval." (No — you either hit it or you didn't; the probability language applies before you collect data.)
- Right: "This method produces intervals that contain the true mean 95% of the time in repeated sampling."
- Beware of overlapping CIs to claim "no significant difference" — that rule of thumb can be conservative or misleading; use proper hypothesis tests for comparisons.
Link to Hypothesis Testing (your previous stop)
- Two-sided test at α corresponds directly to whether the null value is inside the (1 − α) CI.
- CIs provide more information than a binary reject/fail-to-reject: they show effect size and precision.
Practical tips for data scientists
- Always report both the point estimate and CI (e.g., mean = 5.1, 95% CI [4.6, 5.6]).
- Use bootstrap CIs when assumptions (normality, sample size) are questionable. Bootstrapping pairs well with your plotting pipeline.
- Visualize: error bars, violin + points + CI, or the multi-interval coverage plot above — visuals make your uncertainty persuasive.
Quick summary / Takeaways
- Confidence intervals quantify estimation uncertainty using information from the sample and sampling distribution concepts (remember the CLT?).
- Use t-based intervals for means when σ is unknown; z for proportions or known σ.
- CIs and hypothesis tests are siblings: a CI that excludes a null value implies a significant two-sided test.
- Visualize them — seeing many CIs is the best teacher for understanding coverage and real-world variability.
"If your reporting has numbers but no intervals, it's like handing someone a map with a single dot and saying, 'good luck.' CIs put a 'here's likely territory' halo around your point estimate."
Tags: beginner, data-science, python, statistics
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!