Data Analysis with pandas
Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.
Content
Merge, Join, and Concat
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Merge, Join, and Concat in pandas — The Matchmaking & Stacking Guide
Imagine you have two spreadsheets that kind of belong together: one has ages, the other has test scores. How do you make them a power couple without losing anyone's data or sanity?
You're coming in hot from NumPy's vectorized wizardry and pandas' GroupBy soirées. Now it's time for relational choreography: combining tables. Merge, join, and concat are your moves — the tools you use to stitch datasets into one clean, analysis-ready DataFrame.
Why this matters (and where you see it)
- Combining feature tables with labels for machine learning.
- Appending monthly/weekly logs to build a full time series.
- Joining user profiles (table A) with transactions (table B).
Think of merge like speed dating, join like inviting someone to your couch by index, and concat like stacking pancakes.
Quick map: merge vs join vs concat
- pd.merge(df1, df2, how=... , on=...) — SQL-style join on columns. Most flexible for relational joins.
- df.join(df2, how=..., on=...) — Convenience method that joins on index (or on a column if you set it). Nice for index-aligned merges.
- pd.concat([df1, df2], axis=0/1) — Stack vertically (axis=0) or horizontally (axis=1). Equivalent to glueing DataFrames end-to-end.
Micro explanation: When to pick which
- Use merge when you have explicit key columns (like
user_id) and want SQL-like control. - Use join when your DataFrames are aligned by index (time series, reindexed arrays).
- Use concat when you're stacking many DataFrames (log files, batch results) or aligning columns side-by-side.
pd.merge — the SQL power move
Basic example:
import pandas as pd
left = pd.DataFrame({'user_id': [1,2,3], 'age': [25, 30, 22]})
right = pd.DataFrame({'user_id': [2,3,4], 'score': [88, 92, 75]})
pd.merge(left, right, on='user_id', how='inner')
Result: rows for user_id 2 and 3 only (inner join).
Key parameters:
how: 'left', 'right', 'outer', 'inner'on: column(s) to joinleft_on,right_on: when keys have different namessuffixes=('_L','_R'): avoid column-name collisionsindicator=True: adds a column_mergetelling you join origin — great for debugging
Example with indicator:
pd.merge(left, right, on='user_id', how='outer', indicator=True)
# _merge values: 'left_only', 'right_only', 'both'
Practical tip: When joining on categorical keys, convert to pd.Categorical to save memory and sometimes speed up the join.
df.join — index-first, lazy cousin of merge
If your DataFrames are time-indexed, or you've already set the key as the index, join is concise:
left = pd.DataFrame({'age': [25,30,22]}, index=[1,2,3])
right = pd.DataFrame({'score': [88,92,75]}, index=[2,3,4])
left.join(right, how='outer')
This aligns by index. You can also do left.join(right, on='some_column') but then pandas will use the some_column of left to match to index of right — which can be handy.
Micro explanation: join is great when your primary key is the DataFrame's index — common in time series, hierarchical indices, or after set_index.
pd.concat — stacking and stitching
Concatenate vertically (append rows):
pd.concat([df_jan, df_feb, df_mar], axis=0, ignore_index=True)
Concatenate horizontally (add columns):
pd.concat([df_left, df_right], axis=1)
Important options:
axis: 0 (rows) or 1 (columns)ignore_index=True: reset index in the resultkeys=['jan','feb']: create a hierarchical index to track originsjoin='inner'or'outer': controls column intersection when axis=0 and differing columns
Use case: you have 120 CSVs for each day — pd.concat is your fast friend. For many small files, consider reading to a list of DataFrames and concat once (avoid repeated concatenation inside a loop — that's O(n^2) mistakes).
Practical examples and gotchas
- Duplicate column names
pd.merge(df1, df2, on='id', suffixes=('_left','_right'))
- Merging many tables (like feature engineering)
- Chain merges is fine, but consider reducing early with GroupBy operations (you already know how to GroupBy and aggregate) so you merge smaller tables.
- Example flow: raw logs -> groupby user_id aggregations (smaller) -> merge with profile table -> model.
- Performance tips
- Keep keys as integers or categoricals; strings are slower.
- If merging large tables, try an inner join early to reduce size, or pre-filter rows.
- Use indexes: set_index on join keys when doing many repeated joins — index-based joins are faster.
- Append is deprecated; prefer concat
df.append(df2) used to be common, but pd.concat([df, df2]) is the modern, faster approach.
Example workflow that builds on prior lessons
- NumPy gave you fast arrays and vectorized ops — use them to compute features quickly (vectorized transforms).
- Use GroupBy to aggregate logs per user into a compact table (you already handled GroupBy earlier).
- Use pd.merge to join aggregated features to labels for model training.
Code sketch:
# vectorized transformations using NumPy + pandas
logs['duration_sqrt'] = np.sqrt(logs['duration']) # NumPy used earlier
# aggregate
agg = logs.groupby('user_id').agg({'duration_sqrt':'mean', 'n_events':'sum'}).reset_index()
# final join with profiles
final = pd.merge(profiles, agg, on='user_id', how='left')
That sequence respects the principle: compute/transform efficiently, shrink with aggregation, then join.
Quick checklist before you merge
- Are the keys the same dtype? (int vs str mismatch is a classic bug)
- Do you need left/right/outer/inner? (pick explicitly)
- Will duplicate column names collide? Use suffixes.
- If performance matters, can you aggregate or index first?
"This is the moment where the concept finally clicks." — when you realize joins are just controlled alignment of rows and concat is stacking.
Key takeaways
- Use merge for SQL-like relational joins on columns; join for index-aligned merges; concat for stacking.
- Convert keys to efficient dtypes, prefer categorical for repeated keys, and avoid repeated small concatenations.
- Merge after reducing data with vectorized ops and GroupBy when possible — it keeps joins fast and models happy.
Now go forth and unite your scattered tables. Merge thoughtfully, concat responsibly, and never, ever forget to check dtypes.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!