Supervised Machine Learning: Regression and Classification

Chapters

1Foundations of Supervised Learning

2Data Wrangling and Feature Engineering

Data Types and Tidy Structure Handling Missing Values Outlier Detection and Treatment Categorical Encoding Schemes Ordinal vs Nominal Encodings Text Features: Bag-of-Words and TF-IDF Date and Time Feature Extraction Scaling and Normalization Techniques Binning and Discretization Interaction and Polynomial Features Target Leakage in Feature Engineering Feature Creation from Domain Knowledge Sparse vs Dense Representations Feature Hashing Basics Managing High Cardinality

3Exploratory Data Analysis for Predictive Modeling

4Train/Validation/Test and Cross-Validation Strategies

5Regression I: Linear Models

6Regression II: Regularization and Advanced Techniques

7Classification I: Logistic Regression and Probabilistic View

8Classification II: Thresholding, Calibration, and Metrics

9Distance- and Kernel-Based Methods

10Tree-Based Models and Ensembles

11Handling Real-World Data Issues

12Dimensionality Reduction and Feature Selection

13Model Tuning, Pipelines, and Experiment Tracking

14Model Interpretability and Responsible AI

15Deployment, Monitoring, and Capstone Project

Courses/Supervised Machine Learning: Regression and Classification/Data Wrangling and Feature Engineering

Data Wrangling and Feature Engineering

25847 views

Practical techniques to clean, encode, scale, and construct informative features while avoiding leakage.

Content

6 of 15

Text Features: Bag-of-Words and TF-IDF

TF-IDF with Sass and Practical Tips

2416 views

intermediate

humorous

visual

science

gpt-5-mini

2416 views

Versions:

TF-IDF with Sass and Practical Tips

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Text Features: Bag-of-Words and TF-IDF — The Text-to-Model Magic Trick

"Words are features. Features are numbers. Numbers soothe the model."

You already learned how to handle categorical features: ordinal vs nominal encodings and the whole buffet of categorical encoding schemes. Text data is basically categorical on steroids — variable length, massive vocabulary, and a flair for drama. In short: we need ways to convert raw text into numeric features that models actually understand. Enter Bag-of-Words and TF-IDF: the classic dynamic duo for turning sentences into feature vectors.

Why this matters (and why your model cares)

Models need numeric input. Text is not numeric. We must encode it.
Unlike simple category labels, text has order, frequency, and context — but early transforms usually ignore order and focus on presence and importance.
High-cardinality risk: a huge vocabulary leads to very high-dimensional, sparse matrices. That's a practical problem for storage, speed, and overfitting.

Think back to categorical encoding: one-hot exploded dimensionally; target encoding leaked information unless done carefully. Text is similar, except the cardinality is often orders of magnitude larger and the semantics are sneakier.

Bag-of-Words (BoW): The Basic Translation

What it does: It treats each document as a bag (unordered) of words and counts how often each word appears.

Representation: vector of word counts for a fixed vocabulary.
Intuition: a document is summarized by how often words appear; grammar and order are ignored.

Example (micro-drama)

Document A: I love cats

Document B: Cats love me

BoW representations will have identical counts for 'I', 'love', 'cats', 'me' but lose the order. Both documents look similar in the eyes of a bag-of-words model — sometimes useful, sometimes tragic.

Quick scikit-learn snippet

from sklearn.feature_extraction.text import CountVectorizer
corpus = [ 'i love cats', 'cats love me' ]
cv = CountVectorizer()
X = cv.fit_transform(corpus)
# X is a sparse matrix of counts

Pros and cons

Pros: simple, fast, interpretable
Cons: treats all words equally; common words (stop words) dominate; ignores importance and semantics

TF-IDF: Bag-of-Words with Common-Sense Weighting

TF-IDF stands for term frequency–inverse document frequency. It keeps the term frequency idea but down-weights words that are frequent across many documents.

Term frequency (tf): how often term t appears in document d (often normalized).
Inverse document frequency (idf): how rare a term is across the corpus. Rare terms get higher idf.

A common formula:

tf(t,d) = count of t in d / total terms in d
idf(t) = log( (N + 1) / (df(t) + 1) ) + 1
tfidf(t,d) = tf(t,d) * idf(t)

Where N is total documents and df(t) is the number of documents containing term t.

Why tf-idf helps

It reduces the influence of stop words like 'the', 'and', 'is'.
Elevates words that are distinctive for specific documents, which can help classification.

Quick scikit-learn snippet

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [ 'i love cats', 'cats love me' ]
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
# X contains tf-idf scores instead of raw counts

When to use which

Aspect	Bag-of-Words (Count)	TF-IDF
Keeps raw frequency	Yes	Scaled (normalized)
Penalizes common words	No	Yes
Good for	Language models that need counts, simple baselines	Classification, retrieval, when rare but important words matter
Interpretability	Very interpretable	Interpretable with weighting nuance
Sensitivity to document length	High	Lower (if normalized)

Real-world analogies (because metaphors stick like gum)

Bag-of-Words is like counting the ingredients in a soup. You know how many carrots and potatoes, but not the recipe or order.
TF-IDF is like judging which spices make the soup unique. Salt appears in every kitchen; saffron in only a few — saffron tells you more about the soup.

Practical wrinkles and engineering tips

Stop words and preprocessing: Remove obvious stop words, punctuation, and do basic normalization (lowercasing, maybe stemming/lemmatization). But beware: sometimes stop words carry signal (think sentiment: "not").
N-grams: If word order matters a little, include n-grams (bigrams/trigrams). This raises dimensionality fast — apply frequency thresholds.
Vocabulary size: Limit vocab to top-k frequent terms or terms with df thresholds to control sparsity.
Normalization: For counts, consider length normalization. For tf-idf, scikit-learn normalizes by default (L2) which helps for classifiers that care about direction more than magnitude.
Sparse matrices: Always work with sparse representations to save memory. Most linear models accept sparse input.
Feature selection: After conversion, you can apply chi-squared selection, mutual information, or L1 regularization to reduce dimensionality.
Leaky labels: If you compute tf-idf across training + test, you leak information. Fit vectorizer on training only and transform test data.

Common misunderstandings (and how to avoid them)

"TF-IDF will make my model understand semantics." No. It helps surface distinctive words but still ignores meaning and context. For semantics, consider word embeddings or transformer models.
"Higher dimensionality is always better." No — more features often mean more noise and overfitting. Use pruning or regularization.
"TF-IDF always beats counts." Not always. For some tasks raw frequency works better, especially with models that internally weight features or if relative frequency conveys meaning (e.g., topic modeling with counts).

Small pipeline example (conceptual)

Clean text (lowercase, strip punctuation, optional lemmatize)
Split training/test
Fit TfidfVectorizer on train
Transform train and test
Fit classifier (e.g., logistic regression with L2)
Evaluate

train_texts -> TfidfVectorizer.fit_transform -> X_train_sparse
X_train_sparse -> LogisticRegression.fit -> model
test_texts  -> TfidfVectorizer.transform -> X_test_sparse
X_test_sparse -> model.predict

Closing: Key Takeaways (memorize like an exam cheat sheet)

Bag-of-Words: raw counts, simple, interpretable, can be dominated by common words.
TF-IDF: counts weighted by corpus-wide rarity, better for highlighting distinctive terms for classification and retrieval.
Both ignore word order and deep semantics — they are blunt instruments but useful, fast, and often surprisingly effective.

"If BoW/TF-IDF were kitchen tools: BoW is a spoon, TF-IDF is a spoon with a sieve — both scoop, one sifts the common stuff away."

Use them as your first line of attack for text problems: quick, cheap, and explainable. When they fail — and they will for subtle semantic tasks — graduate to embeddings and transformers. But until then, let these classics carry your model to victory.

Need a cheat sheet or a mini-exam question set to practice? Say the word and I will roast you with exercises that are equal parts evil and educational.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics