Courses/Python for Data Science, AI & Development/Data Cleaning and Feature Engineering

Data Cleaning and Feature Engineering

43380 views

Prepare high-quality datasets with robust transformations and informative features while avoiding leakage.

Content

8 of 15

Datetime Parsing and Features

Datetime Parsing and Features in Python for Data Science

4530 views

intermediate

python

time-series

data-science

humorous

gpt-5-mini

4530 views

Versions:

Datetime Parsing and Features in Python for Data Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Datetime Parsing and Features — Practical Guide for Python Data Science

“Dates are sneaky: they look simple until you try to sort, group, or compute with them.”

This lesson jumps straight into transforming messy timestamp strings into analysis-ready datetime features — building on your pandas time-series skills and the text-cleaning tricks you learned earlier (yes, remove the "stupid suffixes" like "1st", "2nd" first). We'll also connect this to feature engineering ideas you saw with polynomials and interactions: timestamps are features too, and how you encode them matters.

Why this matters (no, really)

Time features frequently drive model performance in forecasting, churn, click-through, and fraud detection.
Bad datetime handling = subtle bugs: wrong timezone conversions, DST horrors, or accidentally treating categorical months as continuous.
Good datetime engineering gives you both interpretable and powerful features (seasonality, recency, cyclic patterns).

Quick checklist (what you’ll learn)

Robust parsing from strings to pandas datetime
Common derived features: year, month, weekday, hour, is_weekend
Cyclical encodings (sin/cos) — the seasonal equivalent of feature interactions
Time deltas, lags, rolling features, and exponential-weighted features
Timezone localization & conversion, and common pitfalls

1) Parsing: strings → timestamps (the boring, crucial step)

Basic parsing with pandas

import pandas as pd
s = pd.Series(["2021-03-05 14:22", "03/06/2021 02:15 PM", "June 7, 2021"])
pd.to_datetime(s, errors='coerce', infer_datetime_format=True)

use errors='coerce' to turn bad entries into NaT instead of crashing.
infer_datetime_format=True can speed up parsing if formats are consistent.

Pre-clean common nuisances (link to Text Cleaning Basics)

Remove ordinal suffixes and stray text first:

s_clean = s.str.replace(r"(\d)(st|nd|rd|th)", r"\1", regex=True)

Why: "1st Jan 2021" breaks naive parsers. Your text-cleaning skills from earlier pay off here.

Explicit formats for speed & accuracy

If you know the format, supply it — faster and safer:

pd.to_datetime(df['timestamp'], format='%d/%m/%Y %H:%M', errors='coerce')

2) Vectorized extraction with .dt accessor

Once you have a datetime dtype, use pandas' .dt to extract features efficiently.

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['weekday'] = df['timestamp'].dt.weekday  # 0=Mon
df['hour'] = df['timestamp'].dt.hour

Small, fast, interpretable features. Great as categorical inputs or as bases for interactions (remember polynomial/interaction features? same idea — combine month with product-of-features or categorical encodings).

Helpful extras

is_weekend: df['is_weekend'] = df['weekday'] >= 5
is_month_start/end: df['is_month_start'] = df['timestamp'].dt.is_month_start
quarter: df['quarter'] = df['timestamp'].dt.quarter

3) Cyclical features: sin/cos transforms for periodic behavior

Months and hours are circular: month 12 is close to month 1. If you feed raw 1..12 into many models, the model learns a fake distance. Use sine and cosine transforms to capture cyclicity.

import numpy as np
# Example: hour of day -> two features
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

Think of this as the temporal version of polynomial features — you're creating features that let models express smooth cyclical patterns instead of awkward piecewise jumps.

4) Durations, deltas, and lag features (recency is king)

Calculate intervals and convert to numeric units for predictions:

df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df['duration_sec'] = (df['end'] - df['start']).dt.total_seconds()

Create lags and rolling aggregates (key in time-series or event streams):

# sort by entity + time
df = df.sort_values(['user_id', 'timestamp'])
# lag: time since last event
df['prev_timestamp'] = df.groupby('user_id')['timestamp'].shift(1)
df['time_since_prev'] = (df['timestamp'] - df['prev_timestamp']).dt.total_seconds()
# rolling counts: last 7 days events
df.set_index('timestamp', inplace=True)
rolling = df.groupby('user_id')['event_id'].rolling('7D').count().reset_index(level=0, drop=True)
df['events_last_7d'] = rolling

Exponential-weighted features (recent events matter more):

df['ewm_val'] = df.groupby('user_id')['metric'].apply(lambda x: x.ewm(alpha=0.3).mean())

5) Time zones, localization, and daylight savings (the booby traps)

Naive datetime = no tz info. Use tz_localize to mark data as coming from a particular timezone (don't convert yet).
Use tz_convert to convert an aware datetime to another timezone.

df['ts'] = pd.to_datetime(df['ts'])
# mark as US/Eastern (localize) then convert to UTC
df['ts'] = df['ts'].dt.tz_localize('US/Eastern').dt.tz_convert('UTC')

Pitfall: if timestamps are already timezone-aware, calling tz_localize will error. Use .dt.tz_localize(None) to drop tz info if you must.

DST: ambiguous or nonexistent times during DST transitions can raise errors. Pass arguments like ambiguous='NaT' or ambiguous='infer' when localizing.

Quick rule: For analytics, store UTC; for display, convert to user locale.

6) Performance tips

Parsing large CSV timestamps: specify format where possible and parse dates in read_csv with parse_dates and date_parser (but note date_parser deprecated path — supply converters or parse afterward with to_datetime and format).
Use categorical dtype for extracted cyclical bins (if using month as category).
Use vectorized .dt operations — avoid Python loops.

7) Small real-world recipe (cheat sheet)

Clean strings (strip text, remove ordinals, fix punctuation). Refer to "Text Cleaning Basics" for regex patterns.
pd.to_datetime(..., errors='coerce', infer_datetime_format=True)
Extract: year, month, day, weekday, hour, minute
Create cyclical encodings for hour/month if model benefits
Build recency: time since last event, duration between events
Rolling counts/means and EWM features for behavior
Localize to UTC; store as UTC; convert to user tz when needed

Common gotchas (short horror stories)

Parsing "01/02/2021" — is it Jan 2 or Feb 1? Use dayfirst=True where appropriate or explicit formats.
Treating months as continuous numbers without cyclic encoding — leads to boundary artifacts (Dec→Jan).
DST transitions creating duplicated or missing times — leads to negative durations or NaT.

Final quick takeaways

Always parse strings to dtype datetime early. Until then, your "timestamp" is a liar.
Use .dt to extract features and vectorized ops for speed.
Encode circular features with sin/cos instead of naive integers.
Create recency, lag, rolling, and EWM features — these often beat fancy models.
Handle timezones explicitly: localize then convert; store UTC.

This is the moment where the concept finally clicks: time isn't just a column — it's structure, memory, and rhythm. Treat it like a first-class feature, and your models will thank you (or at least stop making weird seasonal mistakes).

Want a tiny challenge?

Given a dataset of user events with timestamps and actions, build features for:

event count in last 24h
average inter-event time for each user
user's activity hour_sin/hour_cos

Combine them with interaction features (recall polynomial interactions) — e.g., multiply hour_sin by action_type_dummy to let the model learn action-specific hourly patterns.

Good luck. May your timezones be sane and your DST switches gentle.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics