Data Cleaning and Feature Engineering
Prepare high-quality datasets and craft informative features using robust, repeatable pipelines.
Content
Data Quality Assessment
Versions:
Watch & Learn
Data Quality Assessment — The Pre-Flight Checklist for Your Dataset
"Trust, but verify — especially when your dataset looks too good to be true."
You already learned how to slice, dice, and visualize data with NumPy, Pandas, Matplotlib, and Seaborn. Great — now imagine that gorgeous dataset is a pizza. Data Quality Assessment is the moment you look under the cheese to see whether it has mold, missing toppings, or a sock baked in. It's the difference between feeding your model nutritious features and feeding it garbage that gives biased predictions and weird confidence intervals.
This guide picks up where performance tuning and visualization left off and gives you a no-nonsense, practical workflow to assess data quality before you start feature engineering.
Why this matters (short version)
- Garbage in → garbage out. Bad data breaks models, silently and painfully.
- Early checks save time. Fix problems now instead of debugging a stubborn model for days.
- Feature engineering depends on quality. You can't engineer a "good" feature from broken primitives.
A practical checklist (what to do, in order)
- Quick overview: shape, dtypes, memory
- Completeness: missing values and their patterns
- Uniqueness & duplication: duplicate rows and ID problems
- Correctness: valid ranges, types, and parsing
- Consistency & integrity: cross-field rules and referential checks
- Distributional sanity: outliers, skew, and class balance
- Temporal and freshness checks: timeliness and leakage risk
I'll show code snippets using Pandas where relevant, and remind you when Seaborn/Matplotlib visuals paid off earlier are useful.
1) Quick overview — the 30-second health check
Run these to get the lay of the land:
# assume df is your DataFrame
df.shape # rows, cols
df.info() # dtypes and non-null counts
df.memory_usage(deep=True) # know when to optimize
df.describe(include='all').T # summary stats
Look for suspicious dtypes (numbers stored as objects), huge memory use (strings!), or columns with almost all missing values.
2) Completeness — what is missing, and why?
Compute missingness per column and per row:
missing_col = df.isnull().mean().sort_values(ascending=False)
missing_row = df.isnull().sum(axis=1)
# columns with >50% missing
high_missing = missing_col[missing_col > 0.5]
Visual tactics: use a heatmap of missingness (Seaborn) or the missingno library. Ask: Is missingness random (MCAR), dependent on other variables (MAR), or not at random (MNAR)? That matters for imputation strategy.
Quick heuristics:
- If a column is >80% missing, consider dropping it unless it's critical.
- If missingness correlates with target, it may carry signal — treat carefully.
3) Uniqueness & duplicates
Find duplicates and check ID columns:
dupes = df[df.duplicated(keep=False)]
id_issues = df['customer_id'].duplicated().sum()
If an ID column isn't unique, ask whether you expect repeats (time series) or not (transaction-level). Merging later will blow up if keys are wrong.
4) Correctness: types, parsing, and suspicious values
Common problems: numeric values as text, dates as strings, currency with commas, stray whitespace, inconsistent capitalization.
Fix examples:
# convert types
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
# strip whitespace in categorical features
df['city'] = df['city'].str.strip().str.title()
Sanity checks:
- Age < 0 or > 120? suspicious
- Negative prices or impossible timestamps? check source
5) Consistency & integrity — cross-field rules
Check domain-specific rules. Examples:
- start_date <= end_date
- shipped_date >= order_date
- payment_amount == sum(line_items)
bad_orders = df[df['start_date'] > df['end_date']]
These are the type of bugs unit tests for your data pipeline should catch.
6) Distributional sanity: outliers, skew, and class balance
Use histograms and boxplots (you already know Seaborn!). Detect outliers with IQR or z-score:
# IQR method
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['income'] < Q1 - 1.5*IQR) | (df['income'] > Q3 + 1.5*IQR)]
# z-score
from scipy import stats
z = np.abs(stats.zscore(df['income'].dropna()))
z_outliers = df.loc[z > 3]
But caution: outliers might be real and important. Ask: Is income huge because of data entry error or because rich clients exist? Visualize by segment.
Class imbalance: for classification targets, check value counts and consider stratified sampling for train/test.
7) Temporal checks and leakage risk
If your model uses time, ensure training data precedes test data. Check for data leakage — features derived from the future give unrealistically good performance.
Examples:
- A feature 'days_since_last_purchase' computed using the entire dataset leaks future info if not properly rolled forward.
- Ensure target timestamps are after predictor timestamps.
Quick reference table: problem → detection → action
| Problem | How to detect | Typical action |
|---|---|---|
| Missingness | df.isnull().mean() | impute / drop / model missingness |
| Wrong dtype | df.dtypes / df.info() | parse/astype, coerce errors |
| Duplicates | df.duplicated() | dedupe or aggregate |
| Outliers | boxplot, IQR, z-score | investigate, cap, transform |
| Inconsistent categories | df['cat'].value_counts() | normalize, map typos |
| Time leakage | timestamp checks | recreate features with proper windows |
Practical questions to ask (like a detective)
- Where did the data come from, and could the source introduce systematic bias?
- What business rules must hold? Are they enforced?
- Could missingness carry signal? (e.g., 'no response' means 'disengaged')
- Is this data stable over time or drifting?
Closing: key takeaways and a slightly dramatic mic drop
- Run a structured assessment before feature engineering. Treat it like a pre-flight checklist: skip it and you might crash.
- Use quick Pandas commands for broad strokes, and visualize with Seaborn/Matplotlib for nuance. (You already know these tools.)
- Don’t blindly eliminate outliers or impute everything — investigate and document.
Clean data is quiet. Messy data screams for attention.
Go forth and audit your datasets like the paranoid, brilliant TA you are. Your models will thank you with better calibration, fewer bias scandals, and less time spent debugging improbable errors at 2 a.m.
Version notes: This lesson builds on your previous work with array/tabular manipulation and visualization and focuses on detecting problems before you engineer features.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!