Data Cleaning and Feature Engineering
Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.
Content
Text Cleaning Basics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Text Cleaning Basics — Make Your Text Not Garbage
Ever tried building a model on raw text and felt like it learned to predict typos, emojis, and your dataset's weird formatting rather than anything useful? Welcome to the club. We're cleaning that club.
You're already comfortable with numeric feature tricks (feature interactions, polynomials) and with discretizing continuous variables into bins. Text is its own beast — noisy, high-cardinality, and dramatic — but the same pipeline thinking applies: clean → transform → featurize → (maybe) interact. And since you've been using pandas for tabular wrangling, we'll re-use those skills heavily here.
Why this matters (quick and spicy)
- Garbage in → garbage features → garbage predictions. Clean text produces stable, meaningful features (n-grams, TF-IDF, embeddings).
- Cleaning reduces dimensionality and noise before vectorization, which helps regularization and downstream interactions (yes, you can combine text-derived features with numeric polynomials or binned lengths).
- Text length and token counts are useful engineered features — remember discretization? You can bin text length the same way.
The text-cleaning recipe (high level)
- Normalize casing and Unicode
- Remove or normalize noise (URLs, emails, punctuation, emojis) as needed
- Tokenize (split into words or subwords)
- Remove stopwords / rare tokens (or keep depending on task)
- Lemmatize or stem to reduce inflections
- Optionally encode feature-level things (length, entropy, counts)
Think of it as giving text a spa day: exfoliation (punctuation), moisturizing (normalizing case), a haircut (lemmatize/stem), and a new outfit (vectorize).
Practical steps with pandas (because you already know it)
pandas' .str accessor is your best friend for column-wise string ops. You also know joins/time-series/I/O — so load, clean, and merge cleaned text back into your DataFrame with ease.
Example: a small pipeline for a DataFrame column df['text'].
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Sample
df = pd.DataFrame({'text': [
"I LOVE cats!!! Visit http://cats.example 🐱",
"Discount at abc@shop.com - 50% OFF!!",
"I'm going to the store... won't be long"
]})
# 1) Lowercase & normalize whitespace
df['clean'] = df['text'].str.lower().str.strip()
# 2) Remove URLs and emails
url_re = r'https?://\S+|www\.\S+'
email_re = r'\S+@\S+'
df['clean'] = df['clean'].str.replace(url_re, ' ', regex=True)
df['clean'] = df['clean'].str.replace(email_re, ' ', regex=True)
# 3) Remove punctuation (but keep contractions optionally)
df['clean'] = df['clean'].str.replace(r"[^\w\s']", ' ', regex=True)
# 4) Tokenize, remove stopwords, lemmatize
stop = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def tokenize_and_lemmatize(s):
tokens = s.split()
tokens = [t for t in tokens if t not in stop]
return ' '.join(lemmatizer.lemmatize(t) for t in tokens)
df['clean'] = df['clean'].apply(tokenize_and_lemmatize)
print(df)
Micro explanation: use regex to strip known noise (URLs, emails). Keep contractions if they carry sentiment ("don't" vs "do not") — that's a modeling choice.
Common cleaning tasks, when to do them, and why
- Lowercasing: usually safe for English-like tasks, keeps vocabulary smaller.
- Unicode normalization (e.g., NFKC): canonicalizes fancy quotes, accented characters. Use unicodedata.normalize.
- Strip HTML: when text comes from web scraping. Use BeautifulSoup or regex carefully.
- URLs / emails / handles: often treated as a single token (e.g.,
), removed, or replaced with a marker. Useful for spam detection if you want to keep presence information. - Numbers: convert to
, remove, or keep depending on domain (financial data vs casual text). - Punctuation: remove for bag-of-words; keep or transform for models that use syntax (language models, transformers).
- Stopwords: remove to shrink vocabulary and focus on content words — but keep them for sentiment or stylistic tasks.
- Tokenization: whitespace split is simple; use nltk/spacy/transformers tokenizers for better language handling.
- Stemming vs Lemmatization: stemming is fast and rough (PorterStemmer), lemmatization is linguistically better but needs POS for best results.
Small but powerful engineered features from text
- Text length (chars) — you can bin this like we discussed in Feature Binning and Discretization
- Token count, average token length
- Fraction of punctuation, fraction of uppercase characters
- Number of URLs / mentions / emojis
- TF-IDF or count-based n-grams — these are ready for interaction with numeric features (remember Feature Interactions and Polynomials?)
Quick pandas example for length-based binning:
df['char_len'] = df['text'].str.len()
# bin into short/medium/long like discretization
df['len_bin'] = pd.qcut(df['char_len'], q=3, labels=['short','med','long'])
This is the same discretization idea you used for numeric features — now applied to text length.
Tools & libraries cheat-sheet
- pandas .str (fast, vectorized) — use for quick cleaning
- re (regex) — for precise patterns (URLs, emojis, custom rules)
- nltk — tokenization, stopwords, stemming, lemmatization
- spaCy — production-grade tokenization, POS, lemmatization
- transformers / Sentence-BERT — for embeddings (post-cleaning decisions vary)
Rule of thumb: prefer vectorized pandas operations when cleaning large DataFrames for speed; fallback to apply() for complex token-level logic or use libraries with batch APIs.
"This is the moment where the concept finally clicks."
When you realize cleaning is not about making text 'pure' — it's about making text useful for the next step in your pipeline.
Tips, gotchas, and decisions you must make
- Keep copies of raw text. Always. The cleaned version is an experiment.
- Be consistent between training and inference. Cleaning must be identical in both phases (consider a reusable function or transformer).
- Over-cleaning can remove signal (e.g., emojis for sentiment). Choose cleaning based on downstream task.
- For deep learning/transformers, less aggressive cleaning is often better — they handle tokens differently.
Key takeaways
- Clean with intention: know your model and task before you scrub everything away.
- Use pandas .str for efficient column operations and leverage previous pandas mastery for I/O and merges.
- Engineer simple numeric text features (length, counts) and consider binning them as you already do for numeric variables.
- After cleaning comes feature extraction: count vectors, TF-IDF, embeddings — which you can then interact with other features (yes, combine n-grams with polynomials!).
Next stop: turn these clean tokens into features — TF-IDF, n-grams, and embeddings — and then watch how your model starts learning actual patterns instead of punctuation drama.
Tags: beginner, python, data-science
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!