Data Wrangling and Feature Engineering
Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.
Content
Text Features: Bag-of-Words and TF-IDF
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Text Features: Bag-of-Words and TF-IDF — The Text-to-Model Magic Trick
"Words are features. Features are numbers. Numbers soothe the model."
You already learned how to handle categorical features: ordinal vs nominal encodings and the whole buffet of categorical encoding schemes. Text data is basically categorical on steroids — variable length, massive vocabulary, and a flair for drama. In short: we need ways to convert raw text into numeric features that models actually understand. Enter Bag-of-Words and TF-IDF: the classic dynamic duo for turning sentences into feature vectors.
Why this matters (and why your model cares)
- Models need numeric input. Text is not numeric. We must encode it.
- Unlike simple category labels, text has order, frequency, and context — but early transforms usually ignore order and focus on presence and importance.
- High-cardinality risk: a huge vocabulary leads to very high-dimensional, sparse matrices. That's a practical problem for storage, speed, and overfitting.
Think back to categorical encoding: one-hot exploded dimensionally; target encoding leaked information unless done carefully. Text is similar, except the cardinality is often orders of magnitude larger and the semantics are sneakier.
Bag-of-Words (BoW): The Basic Translation
What it does: It treats each document as a bag (unordered) of words and counts how often each word appears.
- Representation: vector of word counts for a fixed vocabulary.
- Intuition: a document is summarized by how often words appear; grammar and order are ignored.
Example (micro-drama)
Document A: I love cats
Document B: Cats love me
BoW representations will have identical counts for 'I', 'love', 'cats', 'me' but lose the order. Both documents look similar in the eyes of a bag-of-words model — sometimes useful, sometimes tragic.
Quick scikit-learn snippet
from sklearn.feature_extraction.text import CountVectorizer
corpus = [ 'i love cats', 'cats love me' ]
cv = CountVectorizer()
X = cv.fit_transform(corpus)
# X is a sparse matrix of counts
Pros and cons
- Pros: simple, fast, interpretable
- Cons: treats all words equally; common words (stop words) dominate; ignores importance and semantics
TF-IDF: Bag-of-Words with Common-Sense Weighting
TF-IDF stands for term frequency–inverse document frequency. It keeps the term frequency idea but down-weights words that are frequent across many documents.
- Term frequency (tf): how often term t appears in document d (often normalized).
- Inverse document frequency (idf): how rare a term is across the corpus. Rare terms get higher idf.
A common formula:
tf(t,d) = count of t in d / total terms in d
idf(t) = log( (N + 1) / (df(t) + 1) ) + 1
tfidf(t,d) = tf(t,d) * idf(t)
Where N is total documents and df(t) is the number of documents containing term t.
Why tf-idf helps
- It reduces the influence of stop words like 'the', 'and', 'is'.
- Elevates words that are distinctive for specific documents, which can help classification.
Quick scikit-learn snippet
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [ 'i love cats', 'cats love me' ]
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
# X contains tf-idf scores instead of raw counts
When to use which
| Aspect | Bag-of-Words (Count) | TF-IDF |
|---|---|---|
| Keeps raw frequency | Yes | Scaled (normalized) |
| Penalizes common words | No | Yes |
| Good for | Language models that need counts, simple baselines | Classification, retrieval, when rare but important words matter |
| Interpretability | Very interpretable | Interpretable with weighting nuance |
| Sensitivity to document length | High | Lower (if normalized) |
Real-world analogies (because metaphors stick like gum)
- Bag-of-Words is like counting the ingredients in a soup. You know how many carrots and potatoes, but not the recipe or order.
- TF-IDF is like judging which spices make the soup unique. Salt appears in every kitchen; saffron in only a few — saffron tells you more about the soup.
Practical wrinkles and engineering tips
Stop words and preprocessing: Remove obvious stop words, punctuation, and do basic normalization (lowercasing, maybe stemming/lemmatization). But beware: sometimes stop words carry signal (think sentiment: "not").
N-grams: If word order matters a little, include n-grams (bigrams/trigrams). This raises dimensionality fast — apply frequency thresholds.
Vocabulary size: Limit vocab to top-k frequent terms or terms with df thresholds to control sparsity.
Normalization: For counts, consider length normalization. For tf-idf, scikit-learn normalizes by default (L2) which helps for classifiers that care about direction more than magnitude.
Sparse matrices: Always work with sparse representations to save memory. Most linear models accept sparse input.
Feature selection: After conversion, you can apply chi-squared selection, mutual information, or L1 regularization to reduce dimensionality.
Leaky labels: If you compute tf-idf across training + test, you leak information. Fit vectorizer on training only and transform test data.
Common misunderstandings (and how to avoid them)
"TF-IDF will make my model understand semantics." No. It helps surface distinctive words but still ignores meaning and context. For semantics, consider word embeddings or transformer models.
"Higher dimensionality is always better." No — more features often mean more noise and overfitting. Use pruning or regularization.
"TF-IDF always beats counts." Not always. For some tasks raw frequency works better, especially with models that internally weight features or if relative frequency conveys meaning (e.g., topic modeling with counts).
Small pipeline example (conceptual)
- Clean text (lowercase, strip punctuation, optional lemmatize)
- Split training/test
- Fit TfidfVectorizer on train
- Transform train and test
- Fit classifier (e.g., logistic regression with L2)
- Evaluate
train_texts -> TfidfVectorizer.fit_transform -> X_train_sparse
X_train_sparse -> LogisticRegression.fit -> model
test_texts -> TfidfVectorizer.transform -> X_test_sparse
X_test_sparse -> model.predict
Closing: Key Takeaways (memorize like an exam cheat sheet)
- Bag-of-Words: raw counts, simple, interpretable, can be dominated by common words.
- TF-IDF: counts weighted by corpus-wide rarity, better for highlighting distinctive terms for classification and retrieval.
- Both ignore word order and deep semantics — they are blunt instruments but useful, fast, and often surprisingly effective.
"If BoW/TF-IDF were kitchen tools: BoW is a spoon, TF-IDF is a spoon with a sieve — both scoop, one sifts the common stuff away."
Use them as your first line of attack for text problems: quick, cheap, and explainable. When they fail — and they will for subtle semantic tasks — graduate to embeddings and transformers. But until then, let these classics carry your model to victory.
Need a cheat sheet or a mini-exam question set to practice? Say the word and I will roast you with exercises that are equal parts evil and educational.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!