Data Cleaning and Feature Engineering
Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.
Content
Datetime Parsing and Features
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Datetime Parsing and Features — Practical Guide for Python Data Science
“Dates are sneaky: they look simple until you try to sort, group, or compute with them.”
This lesson jumps straight into transforming messy timestamp strings into analysis-ready datetime features — building on your pandas time-series skills and the text-cleaning tricks you learned earlier (yes, remove the "stupid suffixes" like "1st", "2nd" first). We'll also connect this to feature engineering ideas you saw with polynomials and interactions: timestamps are features too, and how you encode them matters.
Why this matters (no, really)
- Time features frequently drive model performance in forecasting, churn, click-through, and fraud detection.
- Bad datetime handling = subtle bugs: wrong timezone conversions, DST horrors, or accidentally treating categorical months as continuous.
- Good datetime engineering gives you both interpretable and powerful features (seasonality, recency, cyclic patterns).
Quick checklist (what you’ll learn)
- Robust parsing from strings to pandas datetime
- Common derived features: year, month, weekday, hour, is_weekend
- Cyclical encodings (sin/cos) — the seasonal equivalent of feature interactions
- Time deltas, lags, rolling features, and exponential-weighted features
- Timezone localization & conversion, and common pitfalls
1) Parsing: strings → timestamps (the boring, crucial step)
Basic parsing with pandas
import pandas as pd
s = pd.Series(["2021-03-05 14:22", "03/06/2021 02:15 PM", "June 7, 2021"])
pd.to_datetime(s, errors='coerce', infer_datetime_format=True)
- use errors='coerce' to turn bad entries into NaT instead of crashing.
- infer_datetime_format=True can speed up parsing if formats are consistent.
Pre-clean common nuisances (link to Text Cleaning Basics)
Remove ordinal suffixes and stray text first:
s_clean = s.str.replace(r"(\d)(st|nd|rd|th)", r"\1", regex=True)
Why: "1st Jan 2021" breaks naive parsers. Your text-cleaning skills from earlier pay off here.
Explicit formats for speed & accuracy
If you know the format, supply it — faster and safer:
pd.to_datetime(df['timestamp'], format='%d/%m/%Y %H:%M', errors='coerce')
2) Vectorized extraction with .dt accessor
Once you have a datetime dtype, use pandas' .dt to extract features efficiently.
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['weekday'] = df['timestamp'].dt.weekday # 0=Mon
df['hour'] = df['timestamp'].dt.hour
Small, fast, interpretable features. Great as categorical inputs or as bases for interactions (remember polynomial/interaction features? same idea — combine month with product-of-features or categorical encodings).
Helpful extras
- is_weekend:
df['is_weekend'] = df['weekday'] >= 5 - is_month_start/end:
df['is_month_start'] = df['timestamp'].dt.is_month_start - quarter:
df['quarter'] = df['timestamp'].dt.quarter
3) Cyclical features: sin/cos transforms for periodic behavior
Months and hours are circular: month 12 is close to month 1. If you feed raw 1..12 into many models, the model learns a fake distance. Use sine and cosine transforms to capture cyclicity.
import numpy as np
# Example: hour of day -> two features
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
Think of this as the temporal version of polynomial features — you're creating features that let models express smooth cyclical patterns instead of awkward piecewise jumps.
4) Durations, deltas, and lag features (recency is king)
Calculate intervals and convert to numeric units for predictions:
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df['duration_sec'] = (df['end'] - df['start']).dt.total_seconds()
Create lags and rolling aggregates (key in time-series or event streams):
# sort by entity + time
df = df.sort_values(['user_id', 'timestamp'])
# lag: time since last event
df['prev_timestamp'] = df.groupby('user_id')['timestamp'].shift(1)
df['time_since_prev'] = (df['timestamp'] - df['prev_timestamp']).dt.total_seconds()
# rolling counts: last 7 days events
df.set_index('timestamp', inplace=True)
rolling = df.groupby('user_id')['event_id'].rolling('7D').count().reset_index(level=0, drop=True)
df['events_last_7d'] = rolling
Exponential-weighted features (recent events matter more):
df['ewm_val'] = df.groupby('user_id')['metric'].apply(lambda x: x.ewm(alpha=0.3).mean())
5) Time zones, localization, and daylight savings (the booby traps)
- Naive datetime = no tz info. Use tz_localize to mark data as coming from a particular timezone (don't convert yet).
- Use tz_convert to convert an aware datetime to another timezone.
df['ts'] = pd.to_datetime(df['ts'])
# mark as US/Eastern (localize) then convert to UTC
df['ts'] = df['ts'].dt.tz_localize('US/Eastern').dt.tz_convert('UTC')
Pitfall: if timestamps are already timezone-aware, calling tz_localize will error. Use .dt.tz_localize(None) to drop tz info if you must.
DST: ambiguous or nonexistent times during DST transitions can raise errors. Pass arguments like ambiguous='NaT' or ambiguous='infer' when localizing.
Quick rule: For analytics, store UTC; for display, convert to user locale.
6) Performance tips
- Parsing large CSV timestamps: specify format where possible and parse dates in read_csv with parse_dates and date_parser (but note date_parser deprecated path — supply converters or parse afterward with to_datetime and format).
- Use categorical dtype for extracted cyclical bins (if using month as category).
- Use vectorized .dt operations — avoid Python loops.
7) Small real-world recipe (cheat sheet)
- Clean strings (strip text, remove ordinals, fix punctuation). Refer to "Text Cleaning Basics" for regex patterns.
- pd.to_datetime(..., errors='coerce', infer_datetime_format=True)
- Extract: year, month, day, weekday, hour, minute
- Create cyclical encodings for hour/month if model benefits
- Build recency: time since last event, duration between events
- Rolling counts/means and EWM features for behavior
- Localize to UTC; store as UTC; convert to user tz when needed
Common gotchas (short horror stories)
- Parsing "01/02/2021" — is it Jan 2 or Feb 1? Use dayfirst=True where appropriate or explicit formats.
- Treating months as continuous numbers without cyclic encoding — leads to boundary artifacts (Dec→Jan).
- DST transitions creating duplicated or missing times — leads to negative durations or NaT.
Final quick takeaways
- Always parse strings to dtype datetime early. Until then, your "timestamp" is a liar.
- Use .dt to extract features and vectorized ops for speed.
- Encode circular features with sin/cos instead of naive integers.
- Create recency, lag, rolling, and EWM features — these often beat fancy models.
- Handle timezones explicitly: localize then convert; store UTC.
This is the moment where the concept finally clicks: time isn't just a column — it's structure, memory, and rhythm. Treat it like a first-class feature, and your models will thank you (or at least stop making weird seasonal mistakes).
Want a tiny challenge?
Given a dataset of user events with timestamps and actions, build features for:
- event count in last 24h
- average inter-event time for each user
- user's activity hour_sin/hour_cos
Combine them with interaction features (recall polynomial interactions) — e.g., multiply hour_sin by action_type_dummy to let the model learn action-specific hourly patterns.
Good luck. May your timezones be sane and your DST switches gentle.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!