jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

Series and DataFrame BasicsReading CSV and ExcelIndexing and SelectionFiltering and queryHandling Missing ValuesType Conversion and CategoriesSorting and RankingGroupBy and AggregationsApply and Vectorized OpsMerge, Join, and ConcatPivot Tables and CrosstabsTime Series with pandasWindow and Rolling OpsString Methods and RegexDatabase I/O with SQLAlchemy

5Data Cleaning and Feature Engineering

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42399 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

1 of 15

Series and DataFrame Basics

Pandas Series and DataFrame Basics for Data Science
5450 views
beginner
pandas
data-analysis
python
humorous
gpt-5-mini
5450 views

Versions:

Pandas Series and DataFrame Basics for Data Science

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Pandas Series and DataFrame Basics — Fast, Labeled, and Slightly Dramatic

"If NumPy is the muscle of numerical computing, pandas is the brain—labels, context, and the occasional opinion about your column names."

You already learned how to make NumPy scream with vectorized ops, how memory layout and strides whisper sweet performance tips, and how to stash arrays to disk. Pandas builds on that muscle-memory: it wraps NumPy arrays into labeled, table-like structures and adds a ton of ergonomics for real-world data work. Let’s turn that raw power into usable insight without getting lost in index existentialism.


What are Series and DataFrame, and why they matter

  • Series: a 1‑D labeled array. Think of it as a NumPy array with an index (row labels). It stores values and an index: the sibling your NumPy array always wished it had.
  • DataFrame: a 2‑D tabular container — rows and columns, each column is a Series. Imagine a spreadsheet where each column has consistent dtype and each row has an index label.

Why this matters:

  • Real datasets rarely come with perfectly ordered, nameless arrays. Labels, missing values, mixed dtypes, and metadata are the norm.
  • Pandas gives you indexing, alignment, fast group operations, and convenience methods (head, describe, value_counts) that make exploratory data analysis fast and joyful.

Quick creation recipes (because typing beats memorizing docs)

From Python structures

import pandas as pd
import numpy as np

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

df = pd.DataFrame({
    'age': [25, 30, 22],
    'salary': [50000, 60000, 45000]
}, index=['alice', 'bob', 'carol'])

From NumPy arrays (remember those strides?)

arr = np.arange(12).reshape(4, 3)
df2 = pd.DataFrame(arr, columns=['A', 'B', 'C'])

Note: DataFrame stores data in NumPy-backed blocks. Your previous study of memory layout and strides matters: contiguous arrays are fast for vectorized ops. Pandas may copy or view depending on operation — keep an eye on .values and .to_numpy() when you care about memory.


Indexing, alignment, and selection — stop guessing where the data went

Why do people keep misunderstanding indexing? Because there are at least three ways to do it.

  • .loc — label-based selection (rows and columns by labels)
  • .iloc — integer-position selection (like NumPy)
  • [] — column selection or boolean filtering depending on context

Examples:

# label-based
df.loc['bob', 'salary']

# position-based
df.iloc[1, 1]

# select column (returns Series)
salaries = df['salary']

# boolean mask (vectorized, fast)
high_paid = df[df['salary'] > 50000]

Alignment magic: when you operate on Series with different indices, pandas aligns by index. This is both incredibly useful and the root of many "where did my NaNs come from" bugs. Always check your indices before arithmetic.


Vectorized ops, broadcasting, and performance notes

Pandas operations are often vectorized — behind the scenes they're calling NumPy. So your existing knowledge of broadcasting applies. But remember:

  • Pandas may create copies when aligning indices; this can cost memory and time.
  • Use .values or .to_numpy() when you need raw NumPy performance for tight loops, but beware: you lose label safety.
  • For expression-heavy workloads, try pd.eval() or df.eval() — these can use NumExpr under the hood, which you saw earlier as a performance booster for numerical expressions.

Example:

# vectorized column arithmetic
df['net'] = df['salary'] * 0.8 - df['age'] * 100

# using eval for potentially faster evaluation
df.eval('net = salary * 0.8 - age * 100', inplace=True)

Handling missing data (the polite way to say "NaN party")

Pandas treats missing values explicitly with NaN or NaT. Methods to handle missingness:

  • df.dropna() — remove rows/cols with missing values
  • df.fillna(value) — fill with a value
  • df.isna(), df.notna() — boolean masks for missingness

A pragmatic pattern:

  1. Inspect df.isna().sum() to find culprits
  2. Decide remove vs impute (domain knowledge helps)

Dtypes and memory: bring NumPy's memory lessons

Columns have dtypes: int64, float64, object, category, datetime64, etc. Object dtype is slow and memory-heavy — it's the ragtag union of Python objects.

Tips:

  • Convert text columns with few unique values to category to save memory (like replacing an object array with an integer-backed categorical).
  • Use smaller numeric dtypes if values permit (int32, float32).
  • When reading large files, explicitly set dtypes or use pandas' memory-saving parameters.

This ties back to what you learned about memory layout: structured, contiguous numeric blocks are faster. Pandas tries to store columns in homogeneous blocks; minimize object dtypes.


Common idioms that make you look like you know what you're doing

  • df.head(), df.tail() — peek at the data
  • df.describe() — quick numeric summary
  • df.info() — dtypes and memory usage
  • df['col'].value_counts() — frequency counts
  • df.sort_values('col') — order it up

Why this matters: efficient exploration is 80% of data science. Pandas is your Swiss Army knife for this.


Mini example: combine everything

# Suppose you have a NumPy array of monthly sales
sales = np.array([100, 200, 150, np.nan])
months = ['Jan', 'Feb', 'Mar', 'Apr']

s = pd.Series(sales, index=months, name='sales')

# Fill missing data
s_filled = s.fillna(s.mean())

# Make a DataFrame of product A and B
df = pd.DataFrame({
    'productA': s_filled,
    'productB': s_filled * 0.9
})

# Add computed column using eval (fast)
df.eval('total = productA + productB', inplace=True)

# Grouping and quick summary
print(df.describe())

This shows: create from NumPy, handle NaN, vectorized ops, and eval (hello NumExpr benefits).


Key takeaways (aka the little list you’ll refer to at 2am)

  • Series = 1D labeled array, DataFrame = 2D labeled table. Labels matter.
  • Pandas leverages NumPy under the hood — memory layout and dtype choices still shape performance.
  • Use .loc/.iloc for predictable indexing; expect alignment during arithmetic.
  • Convert heavy object columns to category and use smaller numeric dtypes where appropriate.
  • For expression-heavy operations, consider df.eval() to leverage NumExpr speedups you learned earlier.

"Treat labels like contracts: they promise alignment. Break them only with intention."


Where to go next

Next: deeper groupby patterns, joins/merges, time-series tricks, and IO (efficient formats: parquet, feather — those are your friends after CSV). Also practice: take a messy CSV, load with pandas, inspect dtypes, convert types, and run a few vectorized transforms.

Happy data wrangling. Pandas will save your time, but it won't rescue terrible column names. Rename them.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics