jypi
ExploreChatWays to LearnAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Courses/Introduction to Artificial Intelligence with Python/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

248 views

Prepare high-quality datasets and craft informative features using robust, repeatable pipelines.

Content

1 of 15

Data Quality Assessment

Data-Quality: The No-Nonsense Assessment
60 views
intermediate
humorous
data-science
gpt-5-mini
60 views

Versions:

Data-Quality: The No-Nonsense Assessment

Chapter Study

Watch & Learn

YouTube

Data Quality Assessment — The Pre-Flight Checklist for Your Dataset

"Trust, but verify — especially when your dataset looks too good to be true."

You already learned how to slice, dice, and visualize data with NumPy, Pandas, Matplotlib, and Seaborn. Great — now imagine that gorgeous dataset is a pizza. Data Quality Assessment is the moment you look under the cheese to see whether it has mold, missing toppings, or a sock baked in. It's the difference between feeding your model nutritious features and feeding it garbage that gives biased predictions and weird confidence intervals.

This guide picks up where performance tuning and visualization left off and gives you a no-nonsense, practical workflow to assess data quality before you start feature engineering.


Why this matters (short version)

  • Garbage in → garbage out. Bad data breaks models, silently and painfully.
  • Early checks save time. Fix problems now instead of debugging a stubborn model for days.
  • Feature engineering depends on quality. You can't engineer a "good" feature from broken primitives.

A practical checklist (what to do, in order)

  1. Quick overview: shape, dtypes, memory
  2. Completeness: missing values and their patterns
  3. Uniqueness & duplication: duplicate rows and ID problems
  4. Correctness: valid ranges, types, and parsing
  5. Consistency & integrity: cross-field rules and referential checks
  6. Distributional sanity: outliers, skew, and class balance
  7. Temporal and freshness checks: timeliness and leakage risk

I'll show code snippets using Pandas where relevant, and remind you when Seaborn/Matplotlib visuals paid off earlier are useful.


1) Quick overview — the 30-second health check

Run these to get the lay of the land:

# assume df is your DataFrame
df.shape          # rows, cols
df.info()         # dtypes and non-null counts
df.memory_usage(deep=True)  # know when to optimize
df.describe(include='all').T  # summary stats

Look for suspicious dtypes (numbers stored as objects), huge memory use (strings!), or columns with almost all missing values.


2) Completeness — what is missing, and why?

Compute missingness per column and per row:

missing_col = df.isnull().mean().sort_values(ascending=False)
missing_row = df.isnull().sum(axis=1)

# columns with >50% missing
high_missing = missing_col[missing_col > 0.5]

Visual tactics: use a heatmap of missingness (Seaborn) or the missingno library. Ask: Is missingness random (MCAR), dependent on other variables (MAR), or not at random (MNAR)? That matters for imputation strategy.

Quick heuristics:

  • If a column is >80% missing, consider dropping it unless it's critical.
  • If missingness correlates with target, it may carry signal — treat carefully.

3) Uniqueness & duplicates

Find duplicates and check ID columns:

dupes = df[df.duplicated(keep=False)]
id_issues = df['customer_id'].duplicated().sum()

If an ID column isn't unique, ask whether you expect repeats (time series) or not (transaction-level). Merging later will blow up if keys are wrong.


4) Correctness: types, parsing, and suspicious values

Common problems: numeric values as text, dates as strings, currency with commas, stray whitespace, inconsistent capitalization.

Fix examples:

# convert types
df['age'] = pd.to_numeric(df['age'], errors='coerce')
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
# strip whitespace in categorical features
df['city'] = df['city'].str.strip().str.title()

Sanity checks:

  • Age < 0 or > 120? suspicious
  • Negative prices or impossible timestamps? check source

5) Consistency & integrity — cross-field rules

Check domain-specific rules. Examples:

  • start_date <= end_date
  • shipped_date >= order_date
  • payment_amount == sum(line_items)
bad_orders = df[df['start_date'] > df['end_date']]

These are the type of bugs unit tests for your data pipeline should catch.


6) Distributional sanity: outliers, skew, and class balance

Use histograms and boxplots (you already know Seaborn!). Detect outliers with IQR or z-score:

# IQR method
Q1 = df['income'].quantile(0.25)
Q3 = df['income'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['income'] < Q1 - 1.5*IQR) | (df['income'] > Q3 + 1.5*IQR)]

# z-score
from scipy import stats
z = np.abs(stats.zscore(df['income'].dropna()))
z_outliers = df.loc[z > 3]

But caution: outliers might be real and important. Ask: Is income huge because of data entry error or because rich clients exist? Visualize by segment.

Class imbalance: for classification targets, check value counts and consider stratified sampling for train/test.


7) Temporal checks and leakage risk

If your model uses time, ensure training data precedes test data. Check for data leakage — features derived from the future give unrealistically good performance.

Examples:

  • A feature 'days_since_last_purchase' computed using the entire dataset leaks future info if not properly rolled forward.
  • Ensure target timestamps are after predictor timestamps.

Quick reference table: problem → detection → action

Problem How to detect Typical action
Missingness df.isnull().mean() impute / drop / model missingness
Wrong dtype df.dtypes / df.info() parse/astype, coerce errors
Duplicates df.duplicated() dedupe or aggregate
Outliers boxplot, IQR, z-score investigate, cap, transform
Inconsistent categories df['cat'].value_counts() normalize, map typos
Time leakage timestamp checks recreate features with proper windows

Practical questions to ask (like a detective)

  • Where did the data come from, and could the source introduce systematic bias?
  • What business rules must hold? Are they enforced?
  • Could missingness carry signal? (e.g., 'no response' means 'disengaged')
  • Is this data stable over time or drifting?

Closing: key takeaways and a slightly dramatic mic drop

  • Run a structured assessment before feature engineering. Treat it like a pre-flight checklist: skip it and you might crash.
  • Use quick Pandas commands for broad strokes, and visualize with Seaborn/Matplotlib for nuance. (You already know these tools.)
  • Don’t blindly eliminate outliers or impute everything — investigate and document.

Clean data is quiet. Messy data screams for attention.

Go forth and audit your datasets like the paranoid, brilliant TA you are. Your models will thank you with better calibration, fewer bias scandals, and less time spent debugging improbable errors at 2 a.m.


Version notes: This lesson builds on your previous work with array/tabular manipulation and visualization and focuses on detecting problems before you engineer features.

0 comments
Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics