jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

Series and DataFrame BasicsReading CSV and ExcelIndexing and SelectionFiltering and queryHandling Missing ValuesType Conversion and CategoriesSorting and RankingGroupBy and AggregationsApply and Vectorized OpsMerge, Join, and ConcatPivot Tables and CrosstabsTime Series with pandasWindow and Rolling OpsString Methods and RegexDatabase I/O with SQLAlchemy

5Data Cleaning and Feature Engineering

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42399 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

10 of 15

Merge, Join, and Concat

pandas Merge, Join, and Concat Explained (Practical Guide)
1125 views
beginner
humorous
pandas
data-analysis
gpt-5-mini
1125 views

Versions:

pandas Merge, Join, and Concat Explained (Practical Guide)

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Merge, Join, and Concat in pandas — The Matchmaking & Stacking Guide

Imagine you have two spreadsheets that kind of belong together: one has ages, the other has test scores. How do you make them a power couple without losing anyone's data or sanity?

You're coming in hot from NumPy's vectorized wizardry and pandas' GroupBy soirées. Now it's time for relational choreography: combining tables. Merge, join, and concat are your moves — the tools you use to stitch datasets into one clean, analysis-ready DataFrame.


Why this matters (and where you see it)

  • Combining feature tables with labels for machine learning.
  • Appending monthly/weekly logs to build a full time series.
  • Joining user profiles (table A) with transactions (table B).

Think of merge like speed dating, join like inviting someone to your couch by index, and concat like stacking pancakes.


Quick map: merge vs join vs concat

  • pd.merge(df1, df2, how=... , on=...) — SQL-style join on columns. Most flexible for relational joins.
  • df.join(df2, how=..., on=...) — Convenience method that joins on index (or on a column if you set it). Nice for index-aligned merges.
  • pd.concat([df1, df2], axis=0/1) — Stack vertically (axis=0) or horizontally (axis=1). Equivalent to glueing DataFrames end-to-end.

Micro explanation: When to pick which

  • Use merge when you have explicit key columns (like user_id) and want SQL-like control.
  • Use join when your DataFrames are aligned by index (time series, reindexed arrays).
  • Use concat when you're stacking many DataFrames (log files, batch results) or aligning columns side-by-side.

pd.merge — the SQL power move

Basic example:

import pandas as pd

left = pd.DataFrame({'user_id': [1,2,3], 'age': [25, 30, 22]})
right = pd.DataFrame({'user_id': [2,3,4], 'score': [88, 92, 75]})

pd.merge(left, right, on='user_id', how='inner')

Result: rows for user_id 2 and 3 only (inner join).

Key parameters:

  • how: 'left', 'right', 'outer', 'inner'
  • on: column(s) to join
  • left_on, right_on: when keys have different names
  • suffixes=('_L','_R'): avoid column-name collisions
  • indicator=True: adds a column _merge telling you join origin — great for debugging

Example with indicator:

pd.merge(left, right, on='user_id', how='outer', indicator=True)
# _merge values: 'left_only', 'right_only', 'both'

Practical tip: When joining on categorical keys, convert to pd.Categorical to save memory and sometimes speed up the join.


df.join — index-first, lazy cousin of merge

If your DataFrames are time-indexed, or you've already set the key as the index, join is concise:

left = pd.DataFrame({'age': [25,30,22]}, index=[1,2,3])
right = pd.DataFrame({'score': [88,92,75]}, index=[2,3,4])

left.join(right, how='outer')

This aligns by index. You can also do left.join(right, on='some_column') but then pandas will use the some_column of left to match to index of right — which can be handy.

Micro explanation: join is great when your primary key is the DataFrame's index — common in time series, hierarchical indices, or after set_index.


pd.concat — stacking and stitching

Concatenate vertically (append rows):

pd.concat([df_jan, df_feb, df_mar], axis=0, ignore_index=True)

Concatenate horizontally (add columns):

pd.concat([df_left, df_right], axis=1)

Important options:

  • axis: 0 (rows) or 1 (columns)
  • ignore_index=True: reset index in the result
  • keys=['jan','feb']: create a hierarchical index to track origins
  • join='inner' or 'outer': controls column intersection when axis=0 and differing columns

Use case: you have 120 CSVs for each day — pd.concat is your fast friend. For many small files, consider reading to a list of DataFrames and concat once (avoid repeated concatenation inside a loop — that's O(n^2) mistakes).


Practical examples and gotchas

  1. Duplicate column names
pd.merge(df1, df2, on='id', suffixes=('_left','_right'))
  1. Merging many tables (like feature engineering)
  • Chain merges is fine, but consider reducing early with GroupBy operations (you already know how to GroupBy and aggregate) so you merge smaller tables.
  • Example flow: raw logs -> groupby user_id aggregations (smaller) -> merge with profile table -> model.
  1. Performance tips
  • Keep keys as integers or categoricals; strings are slower.
  • If merging large tables, try an inner join early to reduce size, or pre-filter rows.
  • Use indexes: set_index on join keys when doing many repeated joins — index-based joins are faster.
  1. Append is deprecated; prefer concat

df.append(df2) used to be common, but pd.concat([df, df2]) is the modern, faster approach.


Example workflow that builds on prior lessons

  1. NumPy gave you fast arrays and vectorized ops — use them to compute features quickly (vectorized transforms).
  2. Use GroupBy to aggregate logs per user into a compact table (you already handled GroupBy earlier).
  3. Use pd.merge to join aggregated features to labels for model training.

Code sketch:

# vectorized transformations using NumPy + pandas
logs['duration_sqrt'] = np.sqrt(logs['duration'])  # NumPy used earlier

# aggregate
agg = logs.groupby('user_id').agg({'duration_sqrt':'mean', 'n_events':'sum'}).reset_index()

# final join with profiles
final = pd.merge(profiles, agg, on='user_id', how='left')

That sequence respects the principle: compute/transform efficiently, shrink with aggregation, then join.


Quick checklist before you merge

  • Are the keys the same dtype? (int vs str mismatch is a classic bug)
  • Do you need left/right/outer/inner? (pick explicitly)
  • Will duplicate column names collide? Use suffixes.
  • If performance matters, can you aggregate or index first?

"This is the moment where the concept finally clicks." — when you realize joins are just controlled alignment of rows and concat is stacking.


Key takeaways

  • Use merge for SQL-like relational joins on columns; join for index-aligned merges; concat for stacking.
  • Convert keys to efficient dtypes, prefer categorical for repeated keys, and avoid repeated small concatenations.
  • Merge after reducing data with vectorized ops and GroupBy when possible — it keeps joins fast and models happy.

Now go forth and unite your scattered tables. Merge thoughtfully, concat responsibly, and never, ever forget to check dtypes.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics