Data Visualization and Storytelling
Explore and communicate insights with clear, accessible visuals using Matplotlib, Seaborn, and Plotly.
Content
Histograms and Density Plots
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Histograms and Density Plots — Make Your Data Speak (Without Whispering)
You're past cleaning your data and engineering clever features. Now it's time to see the distribution — to hear the shape of your data sing (or scream).
Why histograms and density plots matter (and why your dataset thanks you)
You already learned to clean data and avoid leakage. That careful prep pays off here: bad bins are unforgiving. Histograms and density plots answer core questions quickly:
- What's the central tendency and spread?
- Are there multiple modes (clusters) hiding?
- Are outliers real or data-entry gremlins?
These plots are the first line of defense for feature engineering decisions (e.g., log transforms, binning) and model choices (e.g., linear vs tree-based).
"If you can’t visualize the distribution, you’re probably engineering features blindfolded." — probably a TA
Quick refresher: histogram vs density plot
- Histogram: Discrete bins counting observations. Great for raw counts and seeing gaps.
- Density plot (KDE): A smooth estimate of the underlying distribution produced by convolving kernel(s) with data. Great for detecting modes without being distracted by arbitrary bins.
Use histograms to answer "how many?" and KDEs to answer "what shape?". Often you use them together: histogram for solidity, KDE for nuance.
Practical issues and what you should tune
1) Bin width (or number of bins)
- Too few bins = oversmoothing (missed structure)
- Too many bins = noise and overfitting to randomness
Common automatic rules:
- Sturges — good for small, close-to-normal samples
- Scott — minimizes MSE for Gaussian assumptions
- Freedman–Diaconis — robust to outliers, uses IQR
In seaborn/matplotlib you can pass bins='fd' or bins='auto' — try several.
2) KDE bandwidth
- Bandwidth controls smoothness. Small = spiky, Large = oversmooth.
- Methods: Silverman, Scott, or manual selection. Visualize a few bandwidths.
3) Normalization and density scaling
- Show counts vs probability density vs percentage.
- For comparing distributions of different sample sizes, use density (area integrates to 1).
4) Log transforms and outliers
- Log/Box–Cox transforms can reveal multiplicative structure.
- Plot on original and log scale for interpretability.
5) Categorical bins and stacked histograms
- For categories, consider
huein seaborn to overlay KDEs or stacked/normalized histograms.
Code: Seaborn (statistical clarity) and Plotly (interactive finesse)
We used Seaborn earlier for statistical plots — now we put that knowledge to work, and show how Plotly makes the same insights interactive.
Example dataset: df['income'] (post-cleaning: no nulls, handled outliers)
Seaborn: histogram + KDE overlay
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='whitegrid')
plt.figure(figsize=(10,5))
# histogram with KDE overlay — good default
sns.histplot(df['income'], bins='fd', kde=True, stat='density', color='C0', edgecolor='black')
plt.title('Income Distribution — histogram + KDE')
plt.xlabel('Income')
plt.ylabel('Density')
plt.show()
Seaborn: compare two groups with KDEs
plt.figure(figsize=(10,5))
# hue draws separate KDE curves; common_norm=False makes them reflect each group's density
sns.kdeplot(data=df, x='income', hue='education_level', common_norm=False, fill=True, alpha=0.3)
plt.title('Income by Education Level — KDE Comparison')
plt.show()
Plotly: interactive histogram + marginal density (great for dashboards)
import plotly.express as px
fig = px.histogram(df, x='income', nbins=50, marginal='violin', histnorm='density', title='Interactive Income Histogram')
fig.update_layout(bargap=0.05)
fig.show()
Tip: interactive hover is invaluable when digging into suspicious spikes you saw after cleaning.
Real-world analogies to remember
- Histogram = Lego tower: each bin stacks the bricks (counts).
- KDE = fog machine smoothing the skyline: you get a continuous silhouette of height.
Imagine plotting heights at a concert: histograms show discrete ticket-holder counts in seating rows; KDE shows where the crowd clusters on average.
When to prefer which plot (cheat sheet)
- Want raw counts, bins as categories → Histogram
- Want smooth modality / number of peaks → KDE
- Compare many groups (overlap) → Facet histograms or KDEs with transparency/hue
- Different sample sizes → Normalize to density
- Dashboard / user interaction → Plotly + hover + selection
Common mistakes (and how to avoid them)
- Using default bins blindly — always check multiple bin widths (or
bins='fd') - Overlaying too many KDEs without transparency — use facets or reduce opacity
- Comparing raw counts across unequal group sizes — use density normalization
- Forgetting transformations — if distribution is heavy-tailed, plot log scale or transform first
Quick workflow (step-by-step)
- Clean and deduplicate the feature (you've done this in Data Cleaning)
- Plot histogram with automatic bin rule (FD/Scott)
- Overlay KDE to inspect modes
- Try log transform if right-skewed; replot
- Compare subgroups (hue / facet) and ensure normalization
- If building an interactive report, replicate in Plotly for exploration
Key takeaways
- Histograms reveal counts and gaps; KDEs reveal smooth shape and modes.
- Tune bins and bandwidth — defaults are a start, not gospel.
- Normalize when comparing groups of different sizes.
- Use both static (Seaborn) and interactive (Plotly) tools depending on audience — you explored both earlier in the course.
Final thought: a distribution plot is like a doctor’s stethoscope — it won’t diagnose everything, but it tells you whether to run more tests.
Next steps
Try: pick one numeric feature from your cleaned dataset, plot histogram + KDE, then experiment with at least three bin rules and two bandwidths. Note how feature-engineering decisions change (e.g., binning, log transform). Save your favorite plot as an interactive Plotly figure for later exploration.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!