Courses/Python for Data Science, AI & Development/Statistics and Probability for Data Science

Statistics and Probability for Data Science

45976 views

Develop statistical intuition for inference, experimentation, and uncertainty-aware decisions.

Content

1 of 15

Descriptive Statistics

Descriptive Statistics for Data Science — Practical & Visual Guide

9119 views

beginner

python

humorous

data-science

descriptive-statistics

gpt-5-mini

9119 views

Versions:

Descriptive Statistics for Data Science — Practical & Visual Guide

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Descriptive Statistics — The TL;DR of Your Data's Personality

(Continuing from our Data Visualization and Storytelling modules — you already know how to make a plot sing, annotate uncertainty, and export figures. Now let's give those plots something meaningful to sing about.)

Hook: Why descriptive stats are your data's elevator pitch

Imagine handing a stranger a 1,000-row CSV and asking them to describe the dataset in 30 seconds. They'd stare. But a good set of descriptive statistics? That's the one-sentence bio: mean, median, spread, shape. It tells you where the data hangs out, how wild it is, and whether it's politely symmetric or angrily skewed.

"This is the moment where the concept finally clicks: visualizations show the shape; descriptive statistics give the summary you can put in a dashboard KPI."

What are Descriptive Statistics and why they matter

Descriptive statistics = simple numeric summaries of data.
They don't infer about populations (that's inferential statistics) — they describe the data you have.

Why it matters:

Quick sanity checks (is this column even numeric?)
Compare groups (mean revenue by region)
Feed dashboards (median delivery time as a KPI)
Annotate plots (add a mean line to a histogram — you learned how to annotate in the visualization module)

Core concepts (with tiny metaphors)

Measures of central tendency

Mean (average) — the balancing point of the data. Great for symmetric data, fragile with outliers.
Median — the middle seat on the bus; robust to outliers.
Mode — the most popular value (useful for categorical or discrete numeric data).

Imagine a party: mean is the center of the dance floor, median is the person who can say "I am exactly in the middle," and mode is the person everyone keeps bumping into.

Measures of spread

Range = max − min (gives a sense, but noisy)
Interquartile Range (IQR) = Q3 − Q1 (robust spread: middle 50%)
Variance and Standard Deviation = average squared deviation and its square root — tells you how spread out values are.

Quick formula (population variance):

sigma^2 = (1/N) * sum((x_i - mu)^2)

Sample variance uses (N-1) so your estimate isn't biased.

Shape and outliers

Skewness — is the tail longer on the right or left? Positive skew means a right tail.
Kurtosis — how heavy are the tails (not "peakedness" as often misstated).
Outliers — extreme points. Use IQR or z-scores to detect.

Practical Python cheatsheet (pandas + numpy + scipy)

import pandas as pd
import numpy as np
from scipy import stats

# example DataFrame
df = pd.DataFrame({'score': [55, 70, 88, 90, 95, 100, 100, 2]})

# quick summary
df['score'].describe()

# explicit
mean = df['score'].mean()
median = df['score'].median()
std = df['score'].std(ddof=1)  # sample std
iqr = df['score'].quantile(0.75) - df['score'].quantile(0.25)
skewness = df['score'].skew()
kurt = df['score'].kurtosis()

# detect outliers via IQR
q1, q3 = df['score'].quantile([0.25, 0.75])
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
outliers = df[(df['score'] < lower) | (df['score'] > upper)]

# z-scores
z = np.abs(stats.zscore(df['score']))
outliers_z = df[z > 3]

Tip: df.describe() is your Swiss Army knife for a quick overview; then dig deeper for robust measures.

Visuals + Descriptive Stats = Superpowered insights

You already learned histograms, boxplots, violin plots and how to communicate uncertainty. Use descriptive stats to:

Annotate a histogram with a vertical line for the mean and median so viewers instantly see skew.
Add IQR and whiskers to boxplots (they're literally built for it).
Put summary numbers (mean, median, sample size, missing%) in the corner of a figure before exporting to the report.

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(df['score'], kde=False)
plt.axvline(mean, color='red', linestyle='--', label=f'Mean: {mean:.1f}')
plt.axvline(median, color='green', linestyle=':', label=f'Median: {median:.1f}')
plt.legend()
plt.title('Score distribution (annotated)')
plt.savefig('score_dist.png')  # you remember exporting from previous module

Why this matters for dashboards: KPIs should be small numbers (median response time) backed by the distribution behind a hover or drilldown. Don't just show the mean and hope for the best.

Robustness & pitfalls (aka why people keep misunderstanding this)

The mean is pulled by outliers. If you have extreme values (e.g., incomes), the mean lies to you.
Small samples cause unreliable estimates; always show sample size.
Missing data can mask patterns — report missing counts and consider imputation carefully.

Why do people misunderstand this? Because a single number feels decisive. It isn't. Always pair a central tendency with a spread and a visualization.

A short workflow to follow (practical steps)

Run df.describe() and check dtype sanity.
Plot histogram + boxplot for the variable.
Compute mean, median, std, IQR, skewness.
Check for outliers (IQR rule or z-scores). Decide: remove, winsorize, or keep and explain.
Annotate figures and export them for reports/dashboards; include the stats as hover info or KPI cards.
When summarizing, always include N and missing%.

Closing: Key takeaways (so you remember at 3 AM)

Descriptive statistics summarize — they don't infer. Use them to understand and communicate your data quickly.
Pair numbers with visuals. A mean without a histogram is like a punchline without the joke setup.
Be transparent. Always report sample size, missingness, and which definition of std/variance you used.