Courses/Full Stack AI and Data Science Professional/Foundations of AI and Data Science

Foundations of AI and Data Science

47 views

Core concepts, roles, workflows, and ethics that frame end‑to‑end AI projects.

Content

5 of 15

Data types and formats

Formats, Feels, and Furious Dtypes

6 views

intermediate

humorous

science

software engineering

gpt-5

6 views

Versions:

Formats, Feels, and Furious Dtypes

Watch & Learn

AI-discovered learning video

YouTube

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Data Types and Formats: The IKEA Furniture of AI (Instructions Included)

You can not out-model bad data typing. You can only suffer gloriously.

We framed the problem. We mapped CRISP-DM like pros. Now we open the box labeled 'Data Understanding' and realize it is 10% insights and 90% trying to figure out why a column called age has a value of 'twenty-five-ish'. This session is your survival guide to data types and formats — the difference between a smooth modeling pipeline and a chaos gremlin that eats weekends.

Why this matters (and where it bites in CRISP-DM)

Data types and formats show up everywhere:

Business Understanding: types determine what is even measurable. You can not A/B test vibes (yet).
Data Understanding: detect the species of data you have, not the mythical one you want.
Data Preparation: casting, parsing, encoding — also known as 'Gym for Data'.
Modeling: models assume types; violate them and you get nonsense or errors.
Evaluation: metrics change by type. RMSE for numbers, F1 for categories, BLEU/ROUGE for text.
Deployment: serialization and schemas keep prod from catching fire.

If you choose the wrong type early, every step after that becomes interpretive dance.

The Big Typology: What kind of data is this goblin?

1) Structured (tabular)

Numeric: integers, floats. Beware 0.1 + 0.2 != 0.3 in floating point land. Use Decimal for money.
Boolean: True/False. (Not 'Y', 'N', 'maybe', and the pumpkin emoji.)
Categorical:
- Nominal: unordered labels (color: red, blue).
- Ordinal: ordered labels (small < medium < large), but distances are not equal.
Datetime/Time: timestamps, date ranges, durations; time zones are spicy.
Text: often arrives as the dreaded 'object' dtype.
Identifiers: user_id, sku; they are labels, not numbers. Do not average them.

Measurement scales (psychometrics crash course):

Nominal (categories), ordinal (order), interval (differences are meaningful; no true zero), ratio (all of the above plus true zero). This matters when choosing encodings and metrics.

Pandas dtype vibe check:

import pandas as pd

df = pd.DataFrame({
    'age': pd.Series([23, 35, None], dtype='float64'),
    'is_active': pd.Series([True, False, True], dtype='boolean'),
    'size': pd.Series(['S','M','L'], dtype='category'),
    'signup_ts': pd.to_datetime(['2023-01-01', '2023-01-02', None]),
    'notes': pd.Series(['hi', 'ok', '...'], dtype='string')
})

Tip: use category for low-cardinality labels; it saves memory and models love it.

2) Semi-structured

JSON, YAML, XML, and log lines. They have structure, but it is nested like a family drama. Great for events, configs, and APIs.
JSON Lines (NDJSON): one JSON object per line. Stream-friendly, analytics-friendly.

3) Unstructured

Text blobs, PDFs, images, audio, video. Models can extract structure, but the raw format is vibes-first.
Pro move: store raw, plus a structured index (embeddings, metadata, OCR) so you can search and learn.

File and Storage Formats: Choose your fighter

Format	Type	Best for	Pros	Watch-outs
CSV/TSV	Row, text	Quick tabular exchange	Human-readable, ubiquitous	No schema, commas-in-text drama, big files
JSON	Semi-structured	APIs, configs, events	Nested, flexible, web-native	Inconsistent schemas, expensive to parse
JSON Lines	Semi-structured	Logs, streaming	Append-friendly, line-by-line	Mixed schemas across lines
Parquet	Columnar, binary	Analytics at scale	Compressed, typed, column-pruning	Harder to eyeball, requires libs
Avro	Row, binary	Streaming with schema	Strong schema + evolution	You must manage schema registry
ORC	Columnar, binary	Hadoop-ish analytics	Compression, predicate pushdown	Ecosystem-specific
HDF5	Hierarchical, binary	Arrays, scientific data	Fast random access, large arrays	Portability concerns
TFRecord	Binary	TensorFlow pipelines	Sequential I/O for training	TF-centric, tooling overhead
Images (PNG/JPEG)	Binary	Vision	PNG lossless; JPEG small files	Color space, metadata, compression
Audio (WAV/MP3)	Binary	Speech/audio	WAV raw; MP3 compressed	Sample rate/bit depth issues
Video (MP4/MKV)	Container	Vision/time	Wide support	Container vs codec confusion

Row vs columnar: row formats (JSON, Avro) are great for writes and transactions; columnar (Parquet, ORC) are great for analytics and SELECT only what you need.

Compression: gzip good, zstd great, snappy fast. Columnar formats compress per-column with magic.

Encoding, Precision, and Time: The unholy trinity

Text encoding: use UTF-8 like it is 2025. Watch for BOMs and curly quotes. Normalize Unicode; 'é' can be one code point or two.
Floats: do not compare for exact equality. For money, use Decimal or integer cents.
Time zones: store UTC internally; display in local. Beware DST gaps and overlaps.

import pandas as pd
s = pd.to_datetime(['2021-03-14 01:59', '2021-03-14 02:01']).tz_localize('US/Pacific', nonexistent='shift_forward')
# DST jumps at 2am; nonexistent times must be handled explicitly

Schemas: Your contract with future-you

A schema states: field names, types, nullability, and constraints.

JSON Schema for JSON; Avro schemas for Avro; Parquet has embedded schema.
Python validation: Pydantic or Marshmallow.
Data quality: Great Expectations to assert 'column X is non-null and between 0 and 1'.

Schema evolution is real: add optional fields, deprecate carefully, version everything.

Tidy data: The Marie Kondo of tables

Tidy rules:

Each variable is a column.
Each observation is a row.
Each value is a cell.

Melting and pivoting turn reporting Franken-tables into model-ready delight.

df_tidy = df.melt(id_vars=['user_id'], var_name='metric', value_name='value')

Text, Images, Audio, Video: The sensory buffet

Text: keep raw text plus processed features (tokens, embeddings). Store language code. Normalize whitespace and Unicode.
Images: arrays (H x W x C). Channels: RGB vs BGR shenanigans. Know color spaces (RGB, HSV). Keep EXIF with caution (it can leak GPS).
Audio: sample rate (e.g., 16 kHz for speech), bit depth (16-bit), mono vs stereo. Spectrograms become 2D arrays you can CNN.
Video: frames + time. Understand container (MP4) vs codec (H.264). Extract frames with consistent FPS.

Databases and Streams

SQL (relational): strict schema, ACID, great for OLTP and clean joins.
NoSQL flavors: document (MongoDB), key-value (Redis), wide-column (Cassandra), graph (Neo4j). Pick based on access pattern.
Streams: Kafka, Kinesis. Message payloads often JSON/Avro/Protobuf. Use a schema registry.

Practical transformations that save projects

Cast early and loudly:

import pandas as pd

usecols = ['price_cents','created_at','country']
df = pd.read_csv('orders.csv', usecols=usecols, dtype={'price_cents': 'Int64', 'country': 'string'})
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['country'] = df['country'].astype('category')

Missing values: distinguish NA (unknown) from empty string (known empty) and zero (real zero). Use sentinel categories like 'Unknown' for labels.
Categorical encodings: one-hot for tree models, target/ordinal encoding for high-cardinality. For deep learning, embeddings.
Sparse data: use sparse matrices when 99% zeros.
Save columnar for analytics:

# Round-trip Parquet
import pandas as pd
df.to_parquet('clean.parquet', index=False)
restored = pd.read_parquet('clean.parquet')

Security and ethics footguns

Pickle is convenient and also a remote code execution party if untrusted. Avoid for interchange.
CSV injection: fields starting with '=' can trigger spreadsheets. Escape or prefix.
PII: redact or tokenize. Hash emails with a salt; do not log secrets.
PDFs and images: strip metadata if distributing.

If your dataset has vibes, your model will have a mood; if your dataset leaks PII, your model will have a lawyer.

Mini case: e-commerce soup

You have orders (tabular), product images (PNG), reviews (text), and clickstream (JSON Lines). Sensible choices:

Store raw events as JSON Lines; batch them into Parquet for analytics.
Orders in a SQL warehouse; ensure monetary types use integers or Decimal.
Images in object storage; keep a table with paths, size, and labels.
Reviews as text; compute sentiment and embeddings, store as numeric arrays in Parquet or a vector DB.
Define schemas for each and validate in CI with Great Expectations.

Now modeling loves you: gradient-boosted trees for tabular, CNN for images, transformer for text, and a late-fusion model for combined predictions.

Common gotchas (aka why your model is weird)

Mixed types in a column: '3', 3, and 'three' are not a personality; they are a bug.
Datetime parsed as local on one machine and UTC on another.
Categories re-encoded with different mappings between train and prod.
JSON fields that drift over time (field missing or renamed silently).
Training on JPEG-compressed images and evaluating on PNG; distribution shift via subtle compression artifacts.

TL;DR and marching orders

Choose representations on purpose: type first, format second.
Prefer columnar (Parquet) for analytics and typed schemas for stability.
Normalize text and time; treat money like it owes you receipts.
Validate with schemas and tests; version data like code.
Keep raw data immutable; derive clean, typed, model-ready layers.

Data types and formats are not housekeeping. They are strategy. Get them right, and the rest of CRISP-DM becomes a highlight reel instead of a blooper montage.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics