Courses/Python for Data Science, AI & Development/Data Analysis with pandas

Data Analysis with pandas

42410 views

Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.

Content

2 of 15

Reading CSV and Excel

Read CSV & Excel in pandas: Fast Guide for Data Scientists

6306 views

pandas

beginner

data-science

python

gpt-5-mini

6306 views

Versions:

Read CSV & Excel in pandas: Fast Guide for Data Scientists

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Read CSV and Excel Files with pandas — Practical, Fast, and a Little Sassy

"You already learned NumPy vectorization — now stop feeding pandas strings and expect it to be fast."

You know how NumPy taught you to think in arrays — broadcasting, vectorized ops, and dtype discipline? Reading files with pandas is the moment you either continue that good habit or fall into the slow, memory-hungry swamp of object dtype and stringy numbers. This guide shows how to read CSV and Excel files the pandas way while preserving your NumPy gains.

Why this matters (short and punchy)

CSV/Excel are the gates through which most real-world data enters your pipeline.
If you read them poorly, all your downstream NumPy vectorization and performance tips from the previous module go to waste.
Read smart: specify dtypes, parse dates, use chunks, and you’ll stay fast and sane.

Quick checklist before reading files

Do you know the data types in the file? If not, inspect a sample first.
Will the file fit into memory? If not, plan for chunksize or use a database.
Are there date columns? Use parse_dates.
Excel file with multiple sheets? Plan sheet_name.

Reading CSV files (the meat)

Basic usage

import pandas as pd
# simple read
df = pd.read_csv('data.csv')

But that’s the naive approach. Here’s how to read like a pro.

Important parameters and why they matter

dtype — explicit types prevent pandas from guessing and using object dtype.
- Example: dtype={'id': 'int64', 'amount': 'float32'}
parse_dates — parse columns as datetimes (works with one column or list/ dict).
usecols — read only necessary columns — huge memory saver.
chunksize or iterator — stream rows in chunks for big files.
na_values — custom missing value tokens.
engine='c' — the default C engine is faster; fallback is python for complex parsing.
compression — pandas auto-detects gz/zip; you can pass compression='gzip'.

Example: typed, date-parsed, columns-only read

dtype_map = {'user_id': 'int32', 'score': 'float32', 'category': 'category'}
parse = ['signup_date']
use = ['user_id', 'signup_date', 'score']

df = pd.read_csv('large.csv', dtype=dtype_map, parse_dates=parse, usecols=use)

Why 'category'? Because categorical dtype is a memory superhero for repeated strings — think labels.

Streaming large CSVs and using NumPy tricks

If file is too big for memory, process in chunks and apply vectorized NumPy ops per chunk.

import numpy as np
chunks = pd.read_csv('huge.csv', chunksize=100_000, dtype={'value': 'float32'})
accum = 0.0
count = 0
for chunk in chunks:
    arr = chunk['value'].to_numpy()   # fast NumPy array
    accum += np.sum(arr)
    count += arr.size
mean = accum / count

This keeps memory usage bounded and uses NumPy for the heavy lifting — remember that sweet vectorization from earlier? Use it.

Reading Excel files (because spreadsheets are still a thing)

Basic usage

# default engine uses openpyxl for .xlsx
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')

Excel-specific tips

sheet_name accepts a string, int, list, or None. If None, you get a dict of DataFrames for all sheets.
engine — choose 'openpyxl' for modern .xlsx, 'xlrd' used to be for .xls but newer xlrd versions dropped xlsx support.
dtype support for read_excel is more limited; often use converters for per-column custom parsing.
usecols and skiprows work here too.

Example: multiple sheets

sheets = pd.read_excel('multi_sheet.xlsx', sheet_name=['Jan', 'Feb'])
# sheets is a dict: {'Jan': DataFrame, 'Feb': DataFrame}

When Excel bites back

Excel files can have merged cells, header rows spread across multiple rows, or notes in the top rows. Use skiprows, header, and usecols to cleanly extract the table.
If types are messy, read columns as strings and coerce later with pandas/NumPy after cleaning.

Common pitfalls (and how to not be that person)

Reading everything without dtype -> object dtype hell (slow, no vectorization).
Letting pandas guess integers with missing values -> float64 result. Use nullable dtypes (pandas>=1.0) or specify converters.
Parsing dates post-hoc with apply(pd.to_datetime) on strings -> slow. Prefer parse_dates at read time.

"Why do people keep misunderstanding this?" — because files often look clean, but the type guessing is lazy and expensive.

Handy recipes (copy–paste ready)

Read compressed CSV directly:

df = pd.read_csv('data.csv.gz', compression='gzip')

Read only first N rows for quick inspection:

sample = pd.read_csv('data.csv', nrows=100)

Convert problematic numeric columns safely:

df = pd.read_csv('data.csv', dtype={'amount': 'str'})
# clean
df['amount'] = pd.to_numeric(df['amount'].str.replace(',', ''), errors='coerce')

Read Excel and stack sheets into one DataFrame with an indicator:

all_sheets = pd.read_excel('multi.xlsx', sheet_name=None)
df = pd.concat(all_sheets.values(), keys=all_sheets.keys(), names=['sheet', 'row'])

Performance checklist (short)

Specify dtype where possible.
Use parse_dates for date columns.
Use usecols to limit columns.
For very large files: use chunksize, process chunk-by-chunk with NumPy operations, then aggregate.
Convert repetitive strings to categorical dtype.

Closing: key takeaways

Reading is part of performance. If you read badly, no amount of NumPy trickery saves you.
Explicit is fast. Tell pandas your dtypes and it won’t waste time guessing.
Stream large files in chunks and use .to_numpy() to apply fast NumPy vectorized functions.
Excel needs extra care. Use converters and sheet_name wisely; sometimes exporting to CSV is the saner path.

"This is the moment where the concept finally clicks: read smart, keep dtypes tidy, and let NumPy do the heavy math."

Quick summary (one-sentence wisdom)

When ingesting data with pandas, be intentional: specify types, parse dates, read only what you need, and process big files in chunks so your NumPy-accelerated pipelines stay fast and memory-friendly.

Tags: pandas, reading-csv-excel, dtype, chunksize

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics