Natural Language Processing
Explore the field of natural language processing (NLP) and how AI can understand and generate human language.
Content
Text Preprocessing
Versions:
Watch & Learn
AI-discovered learning video
Text Preprocessing — The Great Soap-and-Polish of NLP
"If data is the new oil, text preprocessing is what turns crude sludge into something your model won’t explode on." — Probably me, 2 minutes ago
Hook: Why you should care (and fast)
Imagine you fed a deep learning model a dataset where "U.S.A.", "usa", "Usa!" and "United States" are all treated like four different planets. That model will learn useless distinctions and cry during training. Text preprocessing is the set of tidy-up rituals that turn messy, human-written language into a format that models — especially the fancy pretrained ones we love in transfer learning — can actually learn from.
You’ve already met neural nets and transfer learning in "Deep Learning Essentials." Great! Now we’re taking the next logical step: how to prepare raw text so those networks don’t waste their brainpower on punctuation and terrible casing choices.
What is Text Preprocessing (Short, Sassy Definition)
Text preprocessing is the pipeline of cleaning, normalizing, and structuring raw text so downstream models (from classical TF-IDF to Transformer-based giants) can do useful work. Think of it as putting your text through a spa day: exfoliate, hydrate, style.
Why preprocessing matters (practical reasons)
- Reduces noise: Removes artifacts (HTML, emojis, weird encoding) that mislead models.
- Improves generalization: Normalized forms let models learn patterns instead of idiosyncrasies.
- Enables efficient vocabularies: Subword methods and tokenization keep vocab size manageable.
- Works with transfer learning: Some pretrained models expect specific tokenization/casing — break that, and performance drops.
The Core Steps (A Practical Pipeline)
- Normalization & cleaning
- Unicode normalization (NFC/NFKC)
- Remove or fix weird encodings, HTML tags, and control characters
- Handle emojis and non-text symbols (either remove, map to tokens like
, or keep)
- Tokenization
- Word-based (split on whitespace/punctuation)
- Subword (BPE, WordPiece — what modern models use)
- Character-level (for noisy languages or misspellings)
- Lowercasing / Casing decisions
- Lowercase when model expects it; don’t when using a cased pretrained model (e.g., 'bert-base-cased')
- Normalization of numbers, dates, URLs
- Replace with placeholders like
, , if semantics suffice
- Replace with placeholders like
- Stop words / Rare words (optional)
- For classical ML (TF-IDF) remove stop words; for deep learning, often keep them
- Stemming / Lemmatization (optional)
- Useful for search or vocabulary reduction; usually skipped with dense embeddings
- Subword tokenization / Vocab building
- Padding, truncation, attention masks (for batching into models)
- Data augmentation / balancing (if needed)
Tokenization: The Real MVP
Tokenization defines the unit a model sees. Bad tokenization = bad understanding.
- Word-level: simple, intuitive, but vocab explodes and OOVs hurt.
- Subword (BPE, WordPiece): splits rare words into common subunits. This is the default for Transformers. It balances vocab size and OOV handling.
- Char-level: robust to misspellings but sequences get long.
Example: "unbelievable"
- Word-level => ["unbelievable"]
- Subword => ["un", "##believable"] (WordPiece style)
- Char-level => ["u","n","b","e",...]
Code snippet (pseudo-Python) for padding/truncation:
def pad_and_truncate(tokens, max_len, pad_token='<PAD>'):
if len(tokens) > max_len:
return tokens[:max_len]
return tokens + [pad_token] * (max_len - len(tokens))
When using Hugging Face tokenizers, they handle this for you, but you still must set max_length and truncation behavior.
Stemming vs Lemmatization vs Subwords (Quick Table)
| Method | What it does | Pros | Cons | When to use |
|---|---|---|---|---|
| Stemming | Heuristic chopping (e.g., 'running' -> 'run') | Fast, reduces vocab | Crude, may cut words badly | IR, quick baselines |
| Lemmatization | Dictionary-based, grammar-aware ('running' -> 'run') | Linguistically correct | Slower, needs POS sometimes | Linguistic tasks, small corpora |
| Subword (BPE/WordPiece) | Split words into frequent subunits | Handles OOVs, compact vocab | Less interpretable | Deep learning, pretrained models |
Transfer Learning Gotchas (from Deep Learning Essentials)
You remember transfer learning: load a pretrained model, fine-tune. But: if you preprocess differently than the model's pretraining, you sabotage it.
- Pretrained model is cased? Keep case. Lowercasing will mismatch tokenization and embeddings.
- Pretrained tokenizer uses subwords — don’t replace it with naive whitespace tokenization.
- Some models expect special tokens like [CLS], [SEP] — include them or use the tokenizer's encode method.
Pro tip: always use the tokenizer packaged with your pretrained model. It avoids subtle, performance-killing mismatches.
Common Pitfalls & How to Avoid Them
- Removing punctuation blindly — loses sentiment or abbreviations (e.g., "U.S.")
- Lowercasing everything for a cased model — you'll confuse token IDs
- Over-aggressively removing stop words when semantic nuance matters
- Not handling class imbalance — preprocessing can't fix labels
Ask yourself: "Will this transform change the meaning I care about?" If yes, be cautious.
Quick Example: Minimal Python-ish Pipeline
# Rough sketch
raw = load_text()
clean = unicode_normalize(raw)
clean = remove_html(clean)
if use_pretrained_tokenizer:
tokens = pretrained_tokenizer.tokenize(clean)
else:
tokens = basic_tokenize(clean)
tokens = pad_and_truncate(tokens, max_len=128)
For production, prefer robust libraries: Hugging Face tokenizers, spaCy, NLTK, or sacremoses for tokenization.
Closing: Your Preprocessing Checklist (Actionable)
- Decide whether to preserve casing (depends on pretrained model)
- Choose tokenization strategy (use model's tokenizer if using transfer learning)
- Normalize unicode and fix encodings
- Replace or tag URLs, emails, and numbers if needed
- Implement padding/truncation and attention masks for batching
- Validate with a small model: if performance is poor, inspect tokenization artifacts
Final thought: Preprocessing isn't glamorous, but it's where many wins happen. If your model's training looks like chaos, your preprocessing probably is, too. Tune the pipeline before you tune the learning rate.
Versioning
- Version note: This lesson builds on Deep Learning Essentials (transfer learning and challenges) by showing how the input side must be prepared for those models to shine.
Key Takeaways
- Preprocessing shapes everything: it affects vocab, embeddings, and downstream performance.
- Match your preprocessing to your model: use the tokenizer the model was trained with.
- Subwords are queen for modern NLP; stemming/lemmatization are older but still useful in niche cases.
- Test and inspect: always look at tokenized examples and sanity-check outputs.
Go on — tame that raw text. Your models will thank you with fewer exploding gradients and more useful predictions.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!