jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Python for Data Science, AI & Development
Chapters

1Python Foundations for Data Work

2Data Structures and Iteration

3Numerical Computing with NumPy

4Data Analysis with pandas

5Data Cleaning and Feature Engineering

6Data Visualization and Storytelling

7Statistics and Probability for Data Science

8Machine Learning with scikit-learn

9Deep Learning Foundations

10Data Sources, Engineering, and Deployment

Working with Files and FormatsJSON and XML ParsingWeb Scraping BasicsREST APIs and requestsAuthentication and TokensSQL Fundamentalspandas with SQLAlchemyGit and GitHub WorkflowsSpark for Large DatasetsData Versioning with DVCPackaging with Poetry or pipTesting with pytestLogging and ConfigurationBuilding REST APIs with FastAPIContainers and Deployment
Courses/Python for Data Science, AI & Development/Data Sources, Engineering, and Deployment

Data Sources, Engineering, and Deployment

37296 views

Acquire data from files, web, and databases; then test, package, version, and deploy reliable services.

Content

1 of 15

Working with Files and Formats

Working with Files and Formats in Python Data Science
3927 views
beginner
humorous
data-engineering
python
gpt-5-mini
3927 views

Versions:

Working with Files and Formats in Python Data Science

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Working with Files and Formats — Practical Guide for Data Scientists

This is the moment where file formats stop feeling like boring admin work and start saving your GPU cycles, deployment headaches, and sanity.

If you just trained a model on GPUs and experimented with heavy data augmentation, congrats — you peeked behind the curtain and saw data is the other half of performance. Now we focus on the plumbing: how your data actually lives on disk and moves into memory, into the GPU, and into production.


Why files and formats matter (short version)

  • Speed: reading a 100 GB CSV with pandas is a different kind of suffering than reading a 100 GB Parquet file with pyarrow.
  • Reproducibility: formats capture schema/metadata and make pipelines reliable.
  • Memory & IO: encoding, compression, and layout determine whether you can stream data or must fit it in RAM.
  • Deployment: serving expects small, validated payloads; format choices affect latency and cost.

You already know model architecture and GPU tricks from earlier sections. Now learn how to feed them data reliably and fast.


Common formats and when to use them

Tabular

  • CSV: universal, human-readable. Use for small data, demos. Avoid for big datasets due to parsing cost and lack of schema.
  • Parquet / Feather (columnar): best for analytics and big data. Columnar storage + compression = faster column reads and lower IO.
  • HDF5: hierarchical, good for scientific arrays and metadata.
  • TFRecord / RecordIO / Avro: binary, schema-driven; common for streaming and production ML pipelines.

Binary & scientific

  • NumPy .npy / .npz: arrays, fast for numeric workloads.
  • HDF5: multi-dataset storage, chunking and compression.

Images, Video, Audio

  • JPEG / PNG: JPEG is compressed lossy (smaller); PNG lossless (bigger). For training, JPEG decoding is often the CPU bottleneck — consider TFRecord or WebDataset to prepack.
  • WebP / AVIF: newer image formats with better compression, but toolchain support varies.
  • MP4 / WAV: usual suspects for video and audio.

Serialization

  • JSON: great for small structured payloads and APIs; not for huge datasets.
  • Pickle: Python-native but insecure and brittle across versions; avoid for cross-system storage.

Practical Python patterns (code-first, minimal fluff)

Read CSV in chunks to avoid OOM

import pandas as pd
for chunk in pd.read_csv('big.csv', chunksize=100_000, dtype={'id': 'int32'}):
    process(chunk)

Tips: specify dtypes, use parse_dates selectively, and set low_memory=False when types are mixed.

Parquet via pyarrow / pandas

import pandas as pd
# write
df.to_parquet('dataset.parquet', engine='pyarrow', compression='snappy')
# read
df = pd.read_parquet('dataset.parquet', engine='pyarrow')

Parquet keeps schema and is fast for column filtering. Great for model feature stores and analytics.

Memory map large numpy arrays

import numpy as np
arr = np.memmap('big.npy', dtype='float32', mode='r', shape=(1000000, 128))
# slice without loading whole file
batch = arr[:1024]

HDF5 for hierarchical data

import h5py
with h5py.File('data.h5', 'r') as f:
    images = f['images']  # lazy access
    img0 = images[0]

HDF5 supports chunking and compression; good for fixed-shape tensors.

Packing images for efficient training

  • WebDataset / tar files: store files in large tar shards and stream with multiple workers. Great for distributed training.
  • TFRecord or custom binary formats: store preprocessed tensors to avoid costly decode/resize at training time.

Example: using torchvision + WebDataset to stream from tar shards is common for large-scale GPU training.


Engineering for performance and reliability

1) Schema & validation

  • Define a schema for your dataset (column types, ranges, required fields).
  • Use Great Expectations or pydantic for validation during ingestion.

Why: early detection of corrupted files or drift prevents silent model rot.

2) Compression & CPU tradeoffs

  • Snappy or LZ4 give fast decompression and decent compression; gzip has smaller files but slower CPU decompress.
  • Columnar formats + compression reduce IO and sometimes overall time despite CPU usage.

3) Streaming, chunking, and parallel IO

  • Use chunksize or iterators when data > RAM.
  • For many small files (images), prefer sharding (tar/zip) to avoid filesystem overhead.
  • Use multithreaded decoders or GPU-accelerated decoding where available.

4) Metadata & versioning

  • Store schema, data version, and provenance. Use DVC or a data catalog for versioned datasets.
  • Keep a manifest file listing shards and checksums.

5) Security and privacy

  • Never pickle untrusted files. Validate and sanitize inputs in production.
  • For sensitive data, encrypt at rest and follow access controls.

Deployment-specific considerations

  • For low-latency model serving, prefer compact payloads. Use JSON for small structured requests and avoid base64-encoded images for non-trivial payloads (expensive to decode). Prefer URLs or multipart uploads.
  • If serving directly from blob storage, generate signed URLs and stream content into the model server rather than embedding large binary payloads.
  • Prepack model inputs in the same format as training (e.g., TFRecord or normalized numpy arrays) to avoid preprocessing mismatch.
  • When pushing models to edge devices, choose lightweight formats (ONNX for models, protobuf or flatbuffers for data).

Relate this back to what you learned earlier: if your GPU pipeline is starved because your image decoding is single-threaded, no amount of mixed precision or larger batch sizes will fix it. File formats and IO are the other half of performance tuning.


Common pitfalls

  • Unexpected encodings or newline styles in CSVs.
  • Missing schema information causing type inference to be inconsistent across runs.
  • Using pickle for cross-service data exchange.
  • Packing millions of tiny files into cloud object storage without sharding.

Key takeaways

  • Choose the right format: CSV for small and simple; Parquet/Feather for big tabular data; TFRecord/WebDataset for large ML workloads; HDF5 for scientific arrays.
  • Optimize IO, not just compute: preprocess and pack data to reduce CPU decode time and GPU idle time.
  • Validate and version: schema + manifest + checksums = reproducible pipelines.
  • Design for deployment: small payloads, predictable schemas, and secure serialization.

Remember: a fast model with a slow data pipeline is like a racecar stuck in traffic. Fix the roads (files and formats), and the car finally gets to race.


Want a quick workflow checklist?

  1. Pick format: small demo -> CSV/JSON; production -> Parquet/TFRecord/WebDataset.
  2. Define schema and validation tests.
  3. Use chunking or memmap for large files.
  4. Compress with fast codecs (snappy/lz4) when using columnar formats.
  5. Shard many-small-file datasets into tar/record shards for training.
  6. Match training and serving formats to avoid runtime surprises.

Happy data engineering — may your IO be fast and your GPUs never starve.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics