Data Wrangling with NumPy and Pandas
Transform raw data into analysis-ready datasets using vectorized operations and powerful tabular transformations with NumPy and Pandas.
Content
Merging and Joining DataFrames
Versions:
Watch & Learn
AI-discovered learning video
Data Wrangling with NumPy and Pandas
Merging and Joining DataFrames
Your DataFrames are about to network better than you at a conference.
You already sliced and diced DataFrames like a pro in Data Selection and Filtering and met the Pandas DataFrame itself at a formal dinner party. Now we put those skills to work: combining tables. Think SQL joins, but in Pandas pajamas. If you remember left/inner/right/outer from Working with Data Sources and SQL, you are already 60% of the way there. The remaining 40% is avoiding chaos when your keys are messy and your joins explode like a popcorn bag left in the microwave for 20 minutes.
What merging means in Pandas (translation layer from SQL)
- SQL brain: JOIN tables on keys using left/inner/right/outer and maybe cross.
- Pandas brain: use pd.merge, DataFrame.join, and pd.concat; choose keys via columns or index; keep track of duplicates and data types.
Here is your quick map:
| Pandas method | What it does | When to use |
|---|---|---|
| pd.merge(left, right, ...) | SQL-style joins on columns or index | Most flexible and explicit |
| left.join(right, ...) | Joins using index (or a column on left to index on right) | When your keys live in the index |
| pd.concat([...], axis=0/1) | Stack rows (axis=0) or align columns (axis=1) | When you are gluing frames without a key |
And the join flavors you know and love:
- how='inner' keeps only matches in both.
- how='left' keeps everything from left; fills missing from right with NaN.
- how='right' mirror of left.
- how='outer' union of all keys; NaNs where no match.
- how='cross' Cartesian product; use with caution and snacks.
Basic merge: customers and orders
Let’s ground this. You pulled customers from SQL and scraped orders from an API. Time to reconcile.
import pandas as pd
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 3],
'name': ['Ari', 'Bao', 'Cam', 'Cam-alt']
})
orders = pd.DataFrame({
'order_id': [10, 11, 12, 13],
'customer_id': [1, 1, 2, 4],
'amount': [20, 35, 15, 50]
})
left_joined = pd.merge(customers, orders, on='customer_id', how='left')
Result (conceptually):
| customer_id | name | order_id | amount |
|---|---|---|---|
| 1 | Ari | 10 | 20 |
| 1 | Ari | 11 | 35 |
| 2 | Bao | 12 | 15 |
| 3 | Cam | NaN | NaN |
| 3 | Cam-alt | NaN | NaN |
Notes you should tattoo on your brain:
- Duplicates on either side can create multiple rows (one-to-many, many-to-one, or many-to-many). That is normal; you just have to expect it.
- If you intended a one-to-one merge but got duplicate rows, something is wrong in your keys.
Pro tip: enforce expectations with validate.
pd.merge(customers, orders, on='customer_id', how='left', validate='one_to_many')
# Raises if left has duplicate customer_id when you expect unique.
Gotchas that will haunt your joins
1) Mismatched dtypes
If customers.customer_id is a string and orders.customer_id is an integer, you get zero matches and infinite sadness.
customers['customer_id'] = customers['customer_id'].astype('Int64') # or str, but match both sides
orders['customer_id'] = orders['customer_id'].astype('Int64')
2) Ambiguous column names
Overlapping non-key columns get suffixed.
merged = pd.merge(customers, orders, on='customer_id', how='left', suffixes=('_cust', '_ord'))
Now you have name_cust vs name_ord instead of cursed overwrites.
3) Many-to-many blow-ups
If both sides have duplicate keys, every match pairs with every match. That can turn 10k rows into 50 million rows faster than you can say memory error.
left = pd.DataFrame({'color': ['red', 'red', 'blue'], 'item': ['shirt', 'hat', 'scarf']})
right = pd.DataFrame({'color': ['red', 'red', 'blue'], 'supplier': ['A', 'B', 'C']})
# This yields 2x rows for color red (2 left x 2 right = 4)
mm = pd.merge(left, right, on='color', how='inner', validate='many_to_many')
When you did not intend that, dedupe first or validate with one_to_many/many_to_one to catch mistakes.
4) Hidden non-matches
Use indicator to see who matched.
check = pd.merge(customers, orders, on='customer_id', how='outer', indicator=True)
# check['_merge'] shows 'left_only', 'right_only', 'both'
Filter left_only to find customers with no orders, and right_only to find orders with foreign keys that do not exist in customers.
If your join results surprise you, interrogate the keys: dtype, whitespace, case, and hidden duplicates are the usual suspects.
Joining on index vs columns
You can join on the index with DataFrame.join or pd.merge.
depts = pd.DataFrame({
'dept_name': ['Sales', 'Eng', 'Design']
}, index=pd.Index([10, 20, 30], name='dept_id'))
employees = pd.DataFrame({
'emp': ['Ana', 'Ben', 'Cy'],
'dept_id': [10, 99, 20]
})
# Option A: join using employees column to depts index
joined = employees.join(depts, on='dept_id', how='left')
# Option B: merge using index flags
joined2 = pd.merge(employees, depts, left_on='dept_id', right_index=True, how='left')
Rule of thumb:
- If your natural key is the index, join is elegant and fast.
- If you need to match multiple columns or differently named keys, pd.merge is clearer.
Multiple-key merges
Sometimes one key is not enough. Region plus quarter, or symbol plus date.
sales = pd.DataFrame({
'region': ['NE', 'NE', 'SW'],
'quarter': [1, 2, 1],
'revenue': [100, 110, 90]
})
targets = pd.DataFrame({
'region': ['NE', 'NE', 'SW'],
'quarter': [1, 2, 1],
'target': [95, 120, 85]
})
combo = pd.merge(sales, targets, on=['region', 'quarter'], how='inner')
If your key names differ, use left_on and right_on.
Outer joins and clean-up
Outer joins are your reconciliation tool: bring everything together, then fill or coalesce.
combined = pd.merge(customers, orders, on='customer_id', how='outer', indicator=True)
combined['amount'] = combined['amount'].fillna(0) # choose sensible defaults
After an outer join, it is common to:
- fillna for numeric columns
- use where or combine_first to prefer non-null columns
- filter by _merge to audit unmatched records
Cross joins (aka generate all combos)
Use sparingly, but powerful for grids and parameter sweeps.
A = pd.DataFrame({'size': ['S', 'M', 'L']})
B = pd.DataFrame({'color': ['red', 'black']})
cart = pd.merge(A, B, how='cross')
Advanced merges you will brag about later
- merge_asof: approximate join on nearest key. Great for time-series matching with tolerance.
trades = pd.DataFrame({'time': pd.to_datetime(['2024-01-01 09:00','09:00:05']), 'price': [100, 101]})
quotes = pd.DataFrame({'time': pd.to_datetime(['08:59:59','09:00:03']), 'bid': [99, 100]})
matched = pd.merge_asof(trades.sort_values('time'),
quotes.sort_values('time'),
on='time', direction='backward', tolerance=pd.Timedelta('5s'))
- merge_ordered: like outer join but keeps order and can forward-fill for time series.
ordered = pd.merge_ordered(sales, targets, on=['region','quarter'], fill_method='ffill')
Performance and sanity checks
Because joins can be spicy:
- Get your dtypes aligned before merging.
- Trim columns early: select only the columns you need from the right table.
- Avoid accidental many-to-many: check key uniqueness with groupby or value_counts.
assert customers['customer_id'].is_unique # if you expect one row per customer
- Use validate to enforce relationship shape.
- Consider categoricals for repeated string keys to save memory.
orders['status'] = orders['status'].astype('category') # example column
- Pre-filter large rights by key set to reduce work.
keys = customers['customer_id'].unique()
orders_small = orders[orders['customer_id'].isin(keys)]
- Keep order stable by passing sort=False in pd.merge (default), and only sort when you actually need to.
A join you can explain is a join you can trust. If you cannot summarize what matched and what did not, pause and instrument with indicator and validate.
Mini checklist before you press run
- Do both sides use the same dtype for the key columns?
- Are you clear on the relationship? one-to-one, one-to-many, or many-to-many?
- Do you know which rows you intend to keep? Pick how accordingly.
- Have you planned for unmatched rows? NaNs, fillna, or audits via indicator?
- Are overlapping column names handled via suffixes or renaming?
Wrap-up
Merging and joining are where raw data becomes insight-grade. You used your SQL instincts, layered in Pandas mechanics, and learned how to prevent the classic three disasters: wrong dtypes, duplicate keys, and silent non-matches. From here, your next power-up is to reshape and summarize: groupby for aggregations, and pivot/melt for reformatting. But today, you earned the right to say: my DataFrames have a healthy social life.
Final mantra: join deliberately, validate aggressively, and log the weird stuff. Future you will send past you a fruit basket.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!