Statistics and Probability for Data Science
Develop statistical intuition for inference, experimentation, and uncertainty-aware decisions.
Content
Descriptive Statistics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Descriptive Statistics — The TL;DR of Your Data's Personality
(Continuing from our Data Visualization and Storytelling modules — you already know how to make a plot sing, annotate uncertainty, and export figures. Now let's give those plots something meaningful to sing about.)
Hook: Why descriptive stats are your data's elevator pitch
Imagine handing a stranger a 1,000-row CSV and asking them to describe the dataset in 30 seconds. They'd stare. But a good set of descriptive statistics? That's the one-sentence bio: mean, median, spread, shape. It tells you where the data hangs out, how wild it is, and whether it's politely symmetric or angrily skewed.
"This is the moment where the concept finally clicks: visualizations show the shape; descriptive statistics give the summary you can put in a dashboard KPI."
What are Descriptive Statistics and why they matter
- Descriptive statistics = simple numeric summaries of data.
- They don't infer about populations (that's inferential statistics) — they describe the data you have.
Why it matters:
- Quick sanity checks (is this column even numeric?)
- Compare groups (mean revenue by region)
- Feed dashboards (median delivery time as a KPI)
- Annotate plots (add a mean line to a histogram — you learned how to annotate in the visualization module)
Core concepts (with tiny metaphors)
Measures of central tendency
- Mean (average) — the balancing point of the data. Great for symmetric data, fragile with outliers.
- Median — the middle seat on the bus; robust to outliers.
- Mode — the most popular value (useful for categorical or discrete numeric data).
Imagine a party: mean is the center of the dance floor, median is the person who can say "I am exactly in the middle," and mode is the person everyone keeps bumping into.
Measures of spread
- Range = max − min (gives a sense, but noisy)
- Interquartile Range (IQR) = Q3 − Q1 (robust spread: middle 50%)
- Variance and Standard Deviation = average squared deviation and its square root — tells you how spread out values are.
Quick formula (population variance):
sigma^2 = (1/N) * sum((x_i - mu)^2)
Sample variance uses (N-1) so your estimate isn't biased.
Shape and outliers
- Skewness — is the tail longer on the right or left? Positive skew means a right tail.
- Kurtosis — how heavy are the tails (not "peakedness" as often misstated).
- Outliers — extreme points. Use IQR or z-scores to detect.
Practical Python cheatsheet (pandas + numpy + scipy)
import pandas as pd
import numpy as np
from scipy import stats
# example DataFrame
df = pd.DataFrame({'score': [55, 70, 88, 90, 95, 100, 100, 2]})
# quick summary
df['score'].describe()
# explicit
mean = df['score'].mean()
median = df['score'].median()
std = df['score'].std(ddof=1) # sample std
iqr = df['score'].quantile(0.75) - df['score'].quantile(0.25)
skewness = df['score'].skew()
kurt = df['score'].kurtosis()
# detect outliers via IQR
q1, q3 = df['score'].quantile([0.25, 0.75])
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
outliers = df[(df['score'] < lower) | (df['score'] > upper)]
# z-scores
z = np.abs(stats.zscore(df['score']))
outliers_z = df[z > 3]
Tip: df.describe() is your Swiss Army knife for a quick overview; then dig deeper for robust measures.
Visuals + Descriptive Stats = Superpowered insights
You already learned histograms, boxplots, violin plots and how to communicate uncertainty. Use descriptive stats to:
- Annotate a histogram with a vertical line for the mean and median so viewers instantly see skew.
- Add IQR and whiskers to boxplots (they're literally built for it).
- Put summary numbers (mean, median, sample size, missing%) in the corner of a figure before exporting to the report.
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(df['score'], kde=False)
plt.axvline(mean, color='red', linestyle='--', label=f'Mean: {mean:.1f}')
plt.axvline(median, color='green', linestyle=':', label=f'Median: {median:.1f}')
plt.legend()
plt.title('Score distribution (annotated)')
plt.savefig('score_dist.png') # you remember exporting from previous module
Why this matters for dashboards: KPIs should be small numbers (median response time) backed by the distribution behind a hover or drilldown. Don't just show the mean and hope for the best.
Robustness & pitfalls (aka why people keep misunderstanding this)
- The mean is pulled by outliers. If you have extreme values (e.g., incomes), the mean lies to you.
- Small samples cause unreliable estimates; always show sample size.
- Missing data can mask patterns — report missing counts and consider imputation carefully.
Why do people misunderstand this? Because a single number feels decisive. It isn't. Always pair a central tendency with a spread and a visualization.
A short workflow to follow (practical steps)
- Run df.describe() and check dtype sanity.
- Plot histogram + boxplot for the variable.
- Compute mean, median, std, IQR, skewness.
- Check for outliers (IQR rule or z-scores). Decide: remove, winsorize, or keep and explain.
- Annotate figures and export them for reports/dashboards; include the stats as hover info or KPI cards.
- When summarizing, always include N and missing%.
Closing: Key takeaways (so you remember at 3 AM)
- Descriptive statistics summarize — they don't infer. Use them to understand and communicate your data quickly.
- Pair numbers with visuals. A mean without a histogram is like a punchline without the joke setup.
- Be transparent. Always report sample size, missingness, and which definition of std/variance you used.
Memorable insight: If your dashboard shows a single number without a distribution or sample size, it’s doing too much pretending.
Quick reference (what to show in reports/dashboards)
- N (count), missing%
- Mean and median
- Std (or IQR) and range
- Skewness (if relevant)
- Visual: histogram or boxplot + annotated lines
Happy summarizing. Go annotate a plot, put the median in the title, export the figure, and then sleep well knowing your data finally has manners.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!