Courses/Python for Data Science, AI & Development/Data Sources, Engineering, and Deployment

Data Sources, Engineering, and Deployment

37305 views

Acquire data from files, web, and databases; then test, package, version, and deploy reliable services.

Content

1 of 15

Working with Files and Formats

Working with Files and Formats in Python Data Science

3928 views

beginner

humorous

data-engineering

python

gpt-5-mini

3928 views

Versions:

Working with Files and Formats in Python Data Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Working with Files and Formats — Practical Guide for Data Scientists

This is the moment where file formats stop feeling like boring admin work and start saving your GPU cycles, deployment headaches, and sanity.

If you just trained a model on GPUs and experimented with heavy data augmentation, congrats — you peeked behind the curtain and saw data is the other half of performance. Now we focus on the plumbing: how your data actually lives on disk and moves into memory, into the GPU, and into production.

Why files and formats matter (short version)

Speed: reading a 100 GB CSV with pandas is a different kind of suffering than reading a 100 GB Parquet file with pyarrow.
Reproducibility: formats capture schema/metadata and make pipelines reliable.
Memory & IO: encoding, compression, and layout determine whether you can stream data or must fit it in RAM.
Deployment: serving expects small, validated payloads; format choices affect latency and cost.

You already know model architecture and GPU tricks from earlier sections. Now learn how to feed them data reliably and fast.

Common formats and when to use them

Tabular

CSV: universal, human-readable. Use for small data, demos. Avoid for big datasets due to parsing cost and lack of schema.
Parquet / Feather (columnar): best for analytics and big data. Columnar storage + compression = faster column reads and lower IO.
HDF5: hierarchical, good for scientific arrays and metadata.
TFRecord / RecordIO / Avro: binary, schema-driven; common for streaming and production ML pipelines.

Binary & scientific

NumPy .npy / .npz: arrays, fast for numeric workloads.
HDF5: multi-dataset storage, chunking and compression.

Images, Video, Audio

JPEG / PNG: JPEG is compressed lossy (smaller); PNG lossless (bigger). For training, JPEG decoding is often the CPU bottleneck — consider TFRecord or WebDataset to prepack.
WebP / AVIF: newer image formats with better compression, but toolchain support varies.
MP4 / WAV: usual suspects for video and audio.

Serialization

JSON: great for small structured payloads and APIs; not for huge datasets.
Pickle: Python-native but insecure and brittle across versions; avoid for cross-system storage.

Practical Python patterns (code-first, minimal fluff)

Read CSV in chunks to avoid OOM

import pandas as pd
for chunk in pd.read_csv('big.csv', chunksize=100_000, dtype={'id': 'int32'}):
    process(chunk)

Tips: specify dtypes, use parse_dates selectively, and set low_memory=False when types are mixed.

Parquet via pyarrow / pandas

import pandas as pd
# write
df.to_parquet('dataset.parquet', engine='pyarrow', compression='snappy')
# read
df = pd.read_parquet('dataset.parquet', engine='pyarrow')

Parquet keeps schema and is fast for column filtering. Great for model feature stores and analytics.

Memory map large numpy arrays

import numpy as np
arr = np.memmap('big.npy', dtype='float32', mode='r', shape=(1000000, 128))
# slice without loading whole file
batch = arr[:1024]

HDF5 for hierarchical data

import h5py
with h5py.File('data.h5', 'r') as f:
    images = f['images']  # lazy access
    img0 = images[0]

HDF5 supports chunking and compression; good for fixed-shape tensors.

Packing images for efficient training

WebDataset / tar files: store files in large tar shards and stream with multiple workers. Great for distributed training.
TFRecord or custom binary formats: store preprocessed tensors to avoid costly decode/resize at training time.

Example: using torchvision + WebDataset to stream from tar shards is common for large-scale GPU training.

Engineering for performance and reliability

1) Schema & validation

Define a schema for your dataset (column types, ranges, required fields).
Use Great Expectations or pydantic for validation during ingestion.

Why: early detection of corrupted files or drift prevents silent model rot.

2) Compression & CPU tradeoffs

Snappy or LZ4 give fast decompression and decent compression; gzip has smaller files but slower CPU decompress.
Columnar formats + compression reduce IO and sometimes overall time despite CPU usage.

3) Streaming, chunking, and parallel IO

Use chunksize or iterators when data > RAM.
For many small files (images), prefer sharding (tar/zip) to avoid filesystem overhead.
Use multithreaded decoders or GPU-accelerated decoding where available.

4) Metadata & versioning

Store schema, data version, and provenance. Use DVC or a data catalog for versioned datasets.
Keep a manifest file listing shards and checksums.

5) Security and privacy

Never pickle untrusted files. Validate and sanitize inputs in production.
For sensitive data, encrypt at rest and follow access controls.

Deployment-specific considerations

For low-latency model serving, prefer compact payloads. Use JSON for small structured requests and avoid base64-encoded images for non-trivial payloads (expensive to decode). Prefer URLs or multipart uploads.
If serving directly from blob storage, generate signed URLs and stream content into the model server rather than embedding large binary payloads.
Prepack model inputs in the same format as training (e.g., TFRecord or normalized numpy arrays) to avoid preprocessing mismatch.
When pushing models to edge devices, choose lightweight formats (ONNX for models, protobuf or flatbuffers for data).

Relate this back to what you learned earlier: if your GPU pipeline is starved because your image decoding is single-threaded, no amount of mixed precision or larger batch sizes will fix it. File formats and IO are the other half of performance tuning.

Common pitfalls

Unexpected encodings or newline styles in CSVs.
Missing schema information causing type inference to be inconsistent across runs.
Using pickle for cross-service data exchange.
Packing millions of tiny files into cloud object storage without sharding.

Key takeaways

Choose the right format: CSV for small and simple; Parquet/Feather for big tabular data; TFRecord/WebDataset for large ML workloads; HDF5 for scientific arrays.
Optimize IO, not just compute: preprocess and pack data to reduce CPU decode time and GPU idle time.
Validate and version: schema + manifest + checksums = reproducible pipelines.
Design for deployment: small payloads, predictable schemas, and secure serialization.

Remember: a fast model with a slow data pipeline is like a racecar stuck in traffic. Fix the roads (files and formats), and the car finally gets to race.

Want a quick workflow checklist?

Pick format: small demo -> CSV/JSON; production -> Parquet/TFRecord/WebDataset.
Define schema and validation tests.
Use chunking or memmap for large files.
Compress with fast codecs (snappy/lz4) when using columnar formats.
Shard many-small-file datasets into tar/record shards for training.
Match training and serving formats to avoid runtime surprises.

Happy data engineering — may your IO be fast and your GPUs never starve.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics