Data Analysis with pandas
Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.
Content
Series and DataFrame Basics
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Pandas Series and DataFrame Basics — Fast, Labeled, and Slightly Dramatic
"If NumPy is the muscle of numerical computing, pandas is the brain—labels, context, and the occasional opinion about your column names."
You already learned how to make NumPy scream with vectorized ops, how memory layout and strides whisper sweet performance tips, and how to stash arrays to disk. Pandas builds on that muscle-memory: it wraps NumPy arrays into labeled, table-like structures and adds a ton of ergonomics for real-world data work. Let’s turn that raw power into usable insight without getting lost in index existentialism.
What are Series and DataFrame, and why they matter
- Series: a 1‑D labeled array. Think of it as a NumPy array with an index (row labels). It stores values and an index: the sibling your NumPy array always wished it had.
- DataFrame: a 2‑D tabular container — rows and columns, each column is a Series. Imagine a spreadsheet where each column has consistent dtype and each row has an index label.
Why this matters:
- Real datasets rarely come with perfectly ordered, nameless arrays. Labels, missing values, mixed dtypes, and metadata are the norm.
- Pandas gives you indexing, alignment, fast group operations, and convenience methods (head, describe, value_counts) that make exploratory data analysis fast and joyful.
Quick creation recipes (because typing beats memorizing docs)
From Python structures
import pandas as pd
import numpy as np
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
df = pd.DataFrame({
'age': [25, 30, 22],
'salary': [50000, 60000, 45000]
}, index=['alice', 'bob', 'carol'])
From NumPy arrays (remember those strides?)
arr = np.arange(12).reshape(4, 3)
df2 = pd.DataFrame(arr, columns=['A', 'B', 'C'])
Note: DataFrame stores data in NumPy-backed blocks. Your previous study of memory layout and strides matters: contiguous arrays are fast for vectorized ops. Pandas may copy or view depending on operation — keep an eye on .values and .to_numpy() when you care about memory.
Indexing, alignment, and selection — stop guessing where the data went
Why do people keep misunderstanding indexing? Because there are at least three ways to do it.
- .loc — label-based selection (rows and columns by labels)
- .iloc — integer-position selection (like NumPy)
- [] — column selection or boolean filtering depending on context
Examples:
# label-based
df.loc['bob', 'salary']
# position-based
df.iloc[1, 1]
# select column (returns Series)
salaries = df['salary']
# boolean mask (vectorized, fast)
high_paid = df[df['salary'] > 50000]
Alignment magic: when you operate on Series with different indices, pandas aligns by index. This is both incredibly useful and the root of many "where did my NaNs come from" bugs. Always check your indices before arithmetic.
Vectorized ops, broadcasting, and performance notes
Pandas operations are often vectorized — behind the scenes they're calling NumPy. So your existing knowledge of broadcasting applies. But remember:
- Pandas may create copies when aligning indices; this can cost memory and time.
- Use .values or .to_numpy() when you need raw NumPy performance for tight loops, but beware: you lose label safety.
- For expression-heavy workloads, try
pd.eval()ordf.eval()— these can use NumExpr under the hood, which you saw earlier as a performance booster for numerical expressions.
Example:
# vectorized column arithmetic
df['net'] = df['salary'] * 0.8 - df['age'] * 100
# using eval for potentially faster evaluation
df.eval('net = salary * 0.8 - age * 100', inplace=True)
Handling missing data (the polite way to say "NaN party")
Pandas treats missing values explicitly with NaN or NaT. Methods to handle missingness:
- df.dropna() — remove rows/cols with missing values
- df.fillna(value) — fill with a value
- df.isna(), df.notna() — boolean masks for missingness
A pragmatic pattern:
- Inspect df.isna().sum() to find culprits
- Decide remove vs impute (domain knowledge helps)
Dtypes and memory: bring NumPy's memory lessons
Columns have dtypes: int64, float64, object, category, datetime64, etc. Object dtype is slow and memory-heavy — it's the ragtag union of Python objects.
Tips:
- Convert text columns with few unique values to category to save memory (like replacing an object array with an integer-backed categorical).
- Use smaller numeric dtypes if values permit (int32, float32).
- When reading large files, explicitly set dtypes or use pandas' memory-saving parameters.
This ties back to what you learned about memory layout: structured, contiguous numeric blocks are faster. Pandas tries to store columns in homogeneous blocks; minimize object dtypes.
Common idioms that make you look like you know what you're doing
- df.head(), df.tail() — peek at the data
- df.describe() — quick numeric summary
- df.info() — dtypes and memory usage
- df['col'].value_counts() — frequency counts
- df.sort_values('col') — order it up
Why this matters: efficient exploration is 80% of data science. Pandas is your Swiss Army knife for this.
Mini example: combine everything
# Suppose you have a NumPy array of monthly sales
sales = np.array([100, 200, 150, np.nan])
months = ['Jan', 'Feb', 'Mar', 'Apr']
s = pd.Series(sales, index=months, name='sales')
# Fill missing data
s_filled = s.fillna(s.mean())
# Make a DataFrame of product A and B
df = pd.DataFrame({
'productA': s_filled,
'productB': s_filled * 0.9
})
# Add computed column using eval (fast)
df.eval('total = productA + productB', inplace=True)
# Grouping and quick summary
print(df.describe())
This shows: create from NumPy, handle NaN, vectorized ops, and eval (hello NumExpr benefits).
Key takeaways (aka the little list you’ll refer to at 2am)
- Series = 1D labeled array, DataFrame = 2D labeled table. Labels matter.
- Pandas leverages NumPy under the hood — memory layout and dtype choices still shape performance.
- Use .loc/.iloc for predictable indexing; expect alignment during arithmetic.
- Convert heavy object columns to category and use smaller numeric dtypes where appropriate.
- For expression-heavy operations, consider df.eval() to leverage NumExpr speedups you learned earlier.
"Treat labels like contracts: they promise alignment. Break them only with intention."
Where to go next
Next: deeper groupby patterns, joins/merges, time-series tricks, and IO (efficient formats: parquet, feather — those are your friends after CSV). Also practice: take a messy CSV, load with pandas, inspect dtypes, convert types, and run a few vectorized transforms.
Happy data wrangling. Pandas will save your time, but it won't rescue terrible column names. Rename them.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!