Data Analysis with pandas
Manipulate and analyze tabular data using pandas for indexing, joins, time series, and robust I/O.
Content
Reading CSV and Excel
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Read CSV and Excel Files with pandas — Practical, Fast, and a Little Sassy
"You already learned NumPy vectorization — now stop feeding pandas strings and expect it to be fast."
You know how NumPy taught you to think in arrays — broadcasting, vectorized ops, and dtype discipline? Reading files with pandas is the moment you either continue that good habit or fall into the slow, memory-hungry swamp of object dtype and stringy numbers. This guide shows how to read CSV and Excel files the pandas way while preserving your NumPy gains.
Why this matters (short and punchy)
- CSV/Excel are the gates through which most real-world data enters your pipeline.
- If you read them poorly, all your downstream NumPy vectorization and performance tips from the previous module go to waste.
- Read smart: specify dtypes, parse dates, use chunks, and you’ll stay fast and sane.
Quick checklist before reading files
- Do you know the data types in the file? If not, inspect a sample first.
- Will the file fit into memory? If not, plan for chunksize or use a database.
- Are there date columns? Use parse_dates.
- Excel file with multiple sheets? Plan sheet_name.
Reading CSV files (the meat)
Basic usage
import pandas as pd
# simple read
df = pd.read_csv('data.csv')
But that’s the naive approach. Here’s how to read like a pro.
Important parameters and why they matter
- dtype — explicit types prevent pandas from guessing and using object dtype.
- Example: dtype={'id': 'int64', 'amount': 'float32'}
- parse_dates — parse columns as datetimes (works with one column or list/ dict).
- usecols — read only necessary columns — huge memory saver.
- chunksize or iterator — stream rows in chunks for big files.
- na_values — custom missing value tokens.
- engine='c' — the default C engine is faster; fallback is python for complex parsing.
- compression — pandas auto-detects gz/zip; you can pass compression='gzip'.
Example: typed, date-parsed, columns-only read
dtype_map = {'user_id': 'int32', 'score': 'float32', 'category': 'category'}
parse = ['signup_date']
use = ['user_id', 'signup_date', 'score']
df = pd.read_csv('large.csv', dtype=dtype_map, parse_dates=parse, usecols=use)
Why 'category'? Because categorical dtype is a memory superhero for repeated strings — think labels.
Streaming large CSVs and using NumPy tricks
If file is too big for memory, process in chunks and apply vectorized NumPy ops per chunk.
import numpy as np
chunks = pd.read_csv('huge.csv', chunksize=100_000, dtype={'value': 'float32'})
accum = 0.0
count = 0
for chunk in chunks:
arr = chunk['value'].to_numpy() # fast NumPy array
accum += np.sum(arr)
count += arr.size
mean = accum / count
This keeps memory usage bounded and uses NumPy for the heavy lifting — remember that sweet vectorization from earlier? Use it.
Reading Excel files (because spreadsheets are still a thing)
Basic usage
# default engine uses openpyxl for .xlsx
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
Excel-specific tips
- sheet_name accepts a string, int, list, or None. If None, you get a dict of DataFrames for all sheets.
- engine — choose 'openpyxl' for modern .xlsx, 'xlrd' used to be for .xls but newer xlrd versions dropped xlsx support.
- dtype support for read_excel is more limited; often use converters for per-column custom parsing.
- usecols and skiprows work here too.
Example: multiple sheets
sheets = pd.read_excel('multi_sheet.xlsx', sheet_name=['Jan', 'Feb'])
# sheets is a dict: {'Jan': DataFrame, 'Feb': DataFrame}
When Excel bites back
- Excel files can have merged cells, header rows spread across multiple rows, or notes in the top rows. Use skiprows, header, and usecols to cleanly extract the table.
- If types are messy, read columns as strings and coerce later with pandas/NumPy after cleaning.
Common pitfalls (and how to not be that person)
- Reading everything without dtype -> object dtype hell (slow, no vectorization).
- Letting pandas guess integers with missing values -> float64 result. Use nullable dtypes (pandas>=1.0) or specify converters.
- Parsing dates post-hoc with apply(pd.to_datetime) on strings -> slow. Prefer parse_dates at read time.
"Why do people keep misunderstanding this?" — because files often look clean, but the type guessing is lazy and expensive.
Handy recipes (copy–paste ready)
- Read compressed CSV directly:
df = pd.read_csv('data.csv.gz', compression='gzip')
- Read only first N rows for quick inspection:
sample = pd.read_csv('data.csv', nrows=100)
- Convert problematic numeric columns safely:
df = pd.read_csv('data.csv', dtype={'amount': 'str'})
# clean
df['amount'] = pd.to_numeric(df['amount'].str.replace(',', ''), errors='coerce')
- Read Excel and stack sheets into one DataFrame with an indicator:
all_sheets = pd.read_excel('multi.xlsx', sheet_name=None)
df = pd.concat(all_sheets.values(), keys=all_sheets.keys(), names=['sheet', 'row'])
Performance checklist (short)
- Specify dtype where possible.
- Use parse_dates for date columns.
- Use usecols to limit columns.
- For very large files: use chunksize, process chunk-by-chunk with NumPy operations, then aggregate.
- Convert repetitive strings to categorical dtype.
Closing: key takeaways
- Reading is part of performance. If you read badly, no amount of NumPy trickery saves you.
- Explicit is fast. Tell pandas your dtypes and it won’t waste time guessing.
- Stream large files in chunks and use .to_numpy() to apply fast NumPy vectorized functions.
- Excel needs extra care. Use converters and sheet_name wisely; sometimes exporting to CSV is the saner path.
"This is the moment where the concept finally clicks: read smart, keep dtypes tidy, and let NumPy do the heavy math."
Quick summary (one-sentence wisdom)
When ingesting data with pandas, be intentional: specify types, parse dates, read only what you need, and process big files in chunks so your NumPy-accelerated pipelines stay fast and memory-friendly.
Tags: pandas, reading-csv-excel, dtype, chunksize
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!