Data Visualization and Storytelling
Explore and communicate insights with clear, accessible visuals using Matplotlib, Seaborn, and Plotly.
Content
Scatterplots and Pair Plots
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Scatterplots and Pair Plots — Seeing Relationships Like a Data Detective
"If histograms tell you what lives in a single column, scatterplots show you who’s gossiping with whom." — Your slightly obsessed data TA
You've already learned how to inspect single-variable shapes with histograms and density plots, and got hands-on with interactive visuals using Plotly. Earlier, in Data Cleaning and Feature Engineering, you prepared polished features (no leakage, please). Now we'll take those cleaned features out on a date: scatterplots and pair plots. These are the charts that reveal relationships, clusters, and the awkward correlations you didn't want to find.
Why scatterplots and pair plots matter (and when to use them)
- Scatterplots are the go-to for visualizing relationships between two continuous variables. They answer: Does X change when Y changes? Are they correlated? Linear? Noisy?
- Pair plots (scatterplot matrices) let you inspect many pairwise relationships at once — ideal after feature engineering to quickly sanity-check multiple variables.
Real-life appearances:
- Exploring whether advertising spend (X) relates to sales (Y).
- Checking feature redundancy before model training (are two features nearly identical?).
- Spotting non-linearities that suggest feature transforms (log, sqrt) or interactions.
The basics — how to read a scatterplot
- Positive slope → variables increase together.
- Negative slope → one increases as the other decreases.
- No pattern → likely no linear relationship (but might be non-linear).
- Clusters → subgroups or segmentation.
- Outliers → data points shouting "look at me"; investigate!
Micro explanation: A tight line means high correlation; a fuzzy cloud means low correlation. But correlation ≠ causation — the internet already knows this phrase, you should too.
Practical plotting: quick Python recipes
Seaborn scatterplot (annotated)
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='whitegrid')
# df is your cleaned DataFrame from feature engineering
sns.scatterplot(data=df, x='feature_a', y='feature_b',
hue='category', size='importance', alpha=0.7)
plt.title('Feature A vs Feature B by Category')
plt.show()
Tips:
- hue adds a categorical split (color). Great for storytelling: "Here's where each class tends to live."
- size can map importance or another continuous value. Use sparingly.
- alpha combats overplotting.
Add a regression line (Seaborn lmplot)
sns.lmplot(data=df, x='feature_a', y='target', hue='category',
scatter_kws={'alpha':0.5}, ci=95)
plt.title('Trend of Target vs Feature A')
Use this to show the direction and strength of a linear trend. If the line looks flat — time to consider transformations or interactions.
Pair plots: compare everything at once
The quickest sanity check after feature engineering:
sns.pairplot(df[['feature_a','feature_b','feature_c','category']],
hue='category', diag_kind='kde', corner=True)
plt.suptitle('Pair Plot of Key Features', y=1.02)
What to watch for in pair plots:
- Diagonal: histograms or KDEs for single variables — remember those from earlier!
- Lower triangle (with corner=True): scatter relationships for each pair.
- Color clusters: indicates class separation — good for classification.
When a pair plot exposes nearly-identical variable pairs, consider removing redundancy or applying dimensionality reduction (PCA) before modeling.
Interactive pair plots — bring the charts to life (Plotly)
If you loved Plotly for interactivity: use plotly.express.scatter_matrix. Hover to inspect points (great for storytelling slides).
import plotly.express as px
fig = px.scatter_matrix(df, dimensions=['feature_a','feature_b','feature_c'],
color='category', hover_data=['id_col'])
fig.update_layout(width=900, height=900)
fig.show()
Why interactive? Because you can point at a curious cluster in a presentation and say, "Click that, and here's the outlier's case study." Audiences love clicking.
Practical caveats — don't let pretty plots lie to you
- Scale matters: If features are on different scales, patterns can be misleading. Consider standardization for joint visualization.
- Overplotting: For millions of points, use alpha, hexbin, or 2D KDEs instead of raw scatter.
- Feature leakage: Avoid plotting features that contain future information about the target. That gorgeous tight correlation could be a data crime scene.
- Categorical fuzz: Beware encoding tricks — plotting label-encoded categories as numeric can imply order that doesn't exist.
Quick fixes:
- Hexbin: plt.hexbin(x, y, gridsize=50)
- 2D KDE: sns.kdeplot(x=..., y=..., fill=True)
Storytelling with scatterplots — make a point, don't just show data
- Start with a claim: "Increasing X tends to increase Y — here's the evidence." Then show the plot.
- Use annotations to highlight an outlier or inflection point.
- Bring in color deliberately: use color for variables you want the audience to compare.
- Show the pair plot first to justify choosing two or three features for a deeper dive.
Example narrative arc:
- Show pair plot to identify the strongest relationship.
- Zoom into the specific scatterplot, add a regression line and annotations.
- Explain possible reasons and next analysis steps (transformations, segmentation).
Quick checklist before you present scatterplots
- Data cleaned and transformed (no leakage) — thanks, feature engineering.
- Scales handled (log/standardize) if needed.
- Overplotting mitigated.
- Color choices accessible (check colorblind-friendly palettes).
- Annotations ready to tell the story, not just decorate.
Key takeaways
- Scatterplots reveal pairwise relationships; pair plots reveal the social network of your features.
- Always inspect pair plots after feature engineering to spot redundancy, clusters, or bad surprises.
- Use color, size, and interactivity to tell a story — but control for scale, overplotting, and leakage.
Final thought: a scatterplot isn't just a cloud of points — it's a conversation starter. Make sure your plot says something worth hearing.
If you want, I can generate a ready-to-run notebook that: loads a sample cleaned dataset, creates static and interactive scatterplots/pair plots, and annotates key findings for a short presentation. Say the word and I’ll script the visuals like a hype man for your features.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!