Data Quality Assessment — The Pre-Flight Checklist for Your Dataset

"Trust, but verify — especially when your dataset looks too good to be true."

You already learned how to slice, dice, and visualize data with NumPy, Pandas, Matplotlib, and Seaborn. Great — now imagine that gorgeous dataset is a pizza. Data Quality Assessment is the moment you look under the cheese to see whether it has mold, missing toppings, or a sock baked in. It's the difference between feeding your model nutritious features and feeding it garbage that gives biased predictions and weird confidence intervals.

This guide picks up where performance tuning and visualization left off and gives you a no-nonsense, practical workflow to assess data quality before you start feature engineering.

Why this matters (short version)

Garbage in → garbage out. Bad data breaks models, silently and painfully.
Early checks save time. Fix problems now instead of debugging a stubborn model for days.
Feature engineering depends on quality. You can't engineer a "good" feature from broken primitives.

A practical checklist (what to do, in order)

Quick overview: shape, dtypes, memory
Completeness: missing values and their patterns
Uniqueness & duplication: duplicate rows and ID problems
Correctness: valid ranges, types, and parsing
Consistency & integrity: cross-field rules and referential checks
Distributional sanity: outliers, skew, and class balance
Temporal and freshness checks: timeliness and leakage risk

I'll show code snippets using Pandas where relevant, and remind you when Seaborn/Matplotlib visuals paid off earlier are useful.

1) Quick overview — the 30-second health check

Run these to get the lay of the land:

# assume df is your DataFrame
df.shape          # rows, cols
df.info()         # dtypes and non-null counts
df.memory_usage(deep=True)  # know when to optimize
df.describe(include='all').T  # summary stats

Look for suspicious dtypes (numbers stored as objects), huge memory use (strings!), or columns with almost all missing values.

2) Completeness — what is missing, and why?

Compute missingness per column and per row:

missing_col = df.isnull().mean().sort_values(ascending=False)
missing_row = df.isnull().sum(axis=1)

# columns with >50% missing
high_missing = missing_col[missing_col > 0.5]

Visual tactics: use a heatmap of missingness (Seaborn) or the missingno library. Ask: Is missingness random (MCAR), dependent on other variables (MAR), or not at random (MNAR)? That matters for imputation strategy.

Quick heuristics:

If a column is >80% missing, consider dropping it unless it's critical.
If missingness correlates with target, it may carry signal — treat carefully.

3) Uniqueness & duplicates

Find duplicates and check ID columns:

dupes = df[df.duplicated(keep=False)]
id_issues = df['customer_id'].duplicated().sum()

If an ID column isn't unique, ask whether you expect repeats (time series) or not (transaction-level). Merging later will blow up if keys are wrong.

4) Correctness: types, parsing, and suspicious values

Common problems: numeric values as text, dates as strings, currency with commas, stray whitespace, inconsistent capitalization.

Fix examples:

# convert types
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
# strip whitespace in categorical features
df['city'] = df['city'].str.strip().str.title()

Sanity checks:

Age < 0 or > 120? suspicious
Negative prices or impossible timestamps? check source

5) Consistency & integrity — cross-field rules

Check domain-specific rules. Examples:

start_date <= end_date
shipped_date >= order_date
payment_amount == sum(line_items)

bad_orders = df[df['start_date'] > df['end_date']]

These are the type of bugs unit tests for your data pipeline should catch.

6) Distributional sanity: outliers, skew, and class balance

Use histograms and boxplots (you already know Seaborn!). Detect outliers with IQR or z-score:

# IQR method
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['income'] < Q1 - 1.5*IQR) | (df['income'] > Q3 + 1.5*IQR)]

# z-score
from scipy import stats
z = np.abs(stats.zscore(df['income'].dropna()))
z_outliers = df.loc[z > 3]

But caution: outliers might be real and important. Ask: Is income huge because of data entry error or because rich clients exist? Visualize by segment.

Class imbalance: for classification targets, check value counts and consider stratified sampling for train/test.

7) Temporal checks and leakage risk

If your model uses time, ensure training data precedes test data. Check for data leakage — features derived from the future give unrealistically good performance.

Examples:

A feature 'days_since_last_purchase' computed using the entire dataset leaks future info if not properly rolled forward.
Ensure target timestamps are after predictor timestamps.

Quick reference table: problem → detection → action

Problem	How to detect	Typical action
Missingness	df.isnull().mean()	impute / drop / model missingness
Wrong dtype	df.dtypes / df.info()	parse/astype, coerce errors
Duplicates	df.duplicated()	dedupe or aggregate
Outliers	boxplot, IQR, z-score	investigate, cap, transform
Inconsistent categories	df['cat'].value_counts()	normalize, map typos
Time leakage	timestamp checks	recreate features with proper windows

Practical questions to ask (like a detective)

Where did the data come from, and could the source introduce systematic bias?
What business rules must hold? Are they enforced?
Could missingness carry signal? (e.g., 'no response' means 'disengaged')
Is this data stable over time or drifting?

Closing: key takeaways and a slightly dramatic mic drop

Run a structured assessment before feature engineering. Treat it like a pre-flight checklist: skip it and you might crash.
Use quick Pandas commands for broad strokes, and visualize with Seaborn/Matplotlib for nuance. (You already know these tools.)
Don’t blindly eliminate outliers or impute everything — investigate and document.

Clean data is quiet. Messy data screams for attention.

Go forth and audit your datasets like the paranoid, brilliant TA you are. Your models will thank you with better calibration, fewer bias scandals, and less time spent debugging improbable errors at 2 a.m.

Version notes: This lesson builds on your previous work with array/tabular manipulation and visualization and focuses on detecting problems before you engineer features.

Data Cleaning and Feature Engineering

Content