Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42410 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

1 of 15

Series and DataFrame Basics

Pandas Series and DataFrame Basics for Data Science

5454 views

beginner

pandas

data-analysis

python

humorous

gpt-5-mini

5454 views

Versions:

Pandas Series and DataFrame Basics for Data Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Pandas Series and DataFrame Basics — Fast, Labeled, and Slightly Dramatic

"If NumPy is the muscle of numerical computing, pandas is the brain—labels, context, and the occasional opinion about your column names."

You already learned how to make NumPy scream with vectorized ops, how memory layout and strides whisper sweet performance tips, and how to stash arrays to disk. Pandas builds on that muscle-memory: it wraps NumPy arrays into labeled, table-like structures and adds a ton of ergonomics for real-world data work. Let’s turn that raw power into usable insight without getting lost in index existentialism.

What are Series and DataFrame, and why they matter

Series: a 1‑D labeled array. Think of it as a NumPy array with an index (row labels). It stores values and an index: the sibling your NumPy array always wished it had.
DataFrame: a 2‑D tabular container — rows and columns, each column is a Series. Imagine a spreadsheet where each column has consistent dtype and each row has an index label.

Why this matters:

Real datasets rarely come with perfectly ordered, nameless arrays. Labels, missing values, mixed dtypes, and metadata are the norm.
Pandas gives you indexing, alignment, fast group operations, and convenience methods (head, describe, value_counts) that make exploratory data analysis fast and joyful.

Quick creation recipes (because typing beats memorizing docs)

From Python structures

import pandas as pd
import numpy as np

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

df = pd.DataFrame({
    'age': [25, 30, 22],
    'salary': [50000, 60000, 45000]
}, index=['alice', 'bob', 'carol'])

From NumPy arrays (remember those strides?)

arr = np.arange(12).reshape(4, 3)
df2 = pd.DataFrame(arr, columns=['A', 'B', 'C'])

Note: DataFrame stores data in NumPy-backed blocks. Your previous study of memory layout and strides matters: contiguous arrays are fast for vectorized ops. Pandas may copy or view depending on operation — keep an eye on .values and .to_numpy() when you care about memory.

Indexing, alignment, and selection — stop guessing where the data went

Why do people keep misunderstanding indexing? Because there are at least three ways to do it.

.loc — label-based selection (rows and columns by labels)
.iloc — integer-position selection (like NumPy)
[] — column selection or boolean filtering depending on context

Examples:

# label-based
df.loc['bob', 'salary']

# position-based
df.iloc[1, 1]

# select column (returns Series)
salaries = df['salary']

# boolean mask (vectorized, fast)
high_paid = df[df['salary'] > 50000]

Alignment magic: when you operate on Series with different indices, pandas aligns by index. This is both incredibly useful and the root of many "where did my NaNs come from" bugs. Always check your indices before arithmetic.

Vectorized ops, broadcasting, and performance notes

Pandas operations are often vectorized — behind the scenes they're calling NumPy. So your existing knowledge of broadcasting applies. But remember:

Pandas may create copies when aligning indices; this can cost memory and time.
Use .values or .to_numpy() when you need raw NumPy performance for tight loops, but beware: you lose label safety.
For expression-heavy workloads, try pd.eval() or df.eval() — these can use NumExpr under the hood, which you saw earlier as a performance booster for numerical expressions.

Example:

# vectorized column arithmetic
df['net'] = df['salary'] * 0.8 - df['age'] * 100

# using eval for potentially faster evaluation
df.eval('net = salary * 0.8 - age * 100', inplace=True)

Handling missing data (the polite way to say "NaN party")

Pandas treats missing values explicitly with NaN or NaT. Methods to handle missingness:

df.dropna() — remove rows/cols with missing values
df.fillna(value) — fill with a value
df.isna(), df.notna() — boolean masks for missingness

A pragmatic pattern:

Inspect df.isna().sum() to find culprits
Decide remove vs impute (domain knowledge helps)

Dtypes and memory: bring NumPy's memory lessons

Columns have dtypes: int64, float64, object, category, datetime64, etc. Object dtype is slow and memory-heavy — it's the ragtag union of Python objects.

Tips:

Convert text columns with few unique values to category to save memory (like replacing an object array with an integer-backed categorical).
Use smaller numeric dtypes if values permit (int32, float32).
When reading large files, explicitly set dtypes or use pandas' memory-saving parameters.

This ties back to what you learned about memory layout: structured, contiguous numeric blocks are faster. Pandas tries to store columns in homogeneous blocks; minimize object dtypes.

Common idioms that make you look like you know what you're doing

df.head(), df.tail() — peek at the data
df.describe() — quick numeric summary
df.info() — dtypes and memory usage
df['col'].value_counts() — frequency counts
df.sort_values('col') — order it up

Why this matters: efficient exploration is 80% of data science. Pandas is your Swiss Army knife for this.

Mini example: combine everything

# Suppose you have a NumPy array of monthly sales
sales = np.array([100, 200, 150, np.nan])
months = ['Jan', 'Feb', 'Mar', 'Apr']

s = pd.Series(sales, index=months, name='sales')

# Fill missing data
s_filled = s.fillna(s.mean())

# Make a DataFrame of product A and B
df = pd.DataFrame({
    'productA': s_filled,
    'productB': s_filled * 0.9
})

# Add computed column using eval (fast)
df.eval('total = productA + productB', inplace=True)

# Grouping and quick summary
print(df.describe())

This shows: create from NumPy, handle NaN, vectorized ops, and eval (hello NumExpr benefits).

Key takeaways (aka the little list you’ll refer to at 2am)

Series = 1D labeled array, DataFrame = 2D labeled table. Labels matter.
Pandas leverages NumPy under the hood — memory layout and dtype choices still shape performance.
Use .loc/.iloc for predictable indexing; expect alignment during arithmetic.
Convert heavy object columns to category and use smaller numeric dtypes where appropriate.
For expression-heavy operations, consider df.eval() to leverage NumExpr speedups you learned earlier.

"Treat labels like contracts: they promise alignment. Break them only with intention."

Where to go next

Next: deeper groupby patterns, joins/merges, time-series tricks, and IO (efficient formats: parquet, feather — those are your friends after CSV). Also practice: take a messy CSV, load with pandas, inspect dtypes, convert types, and run a few vectorized transforms.

Happy data wrangling. Pandas will save your time, but it won't rescue terrible column names. Rename them.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics