Foundations of AI and Data Science
Core concepts, roles, workflows, and ethics that frame end‑to‑end AI projects.
Content
Data types and formats
Versions:
Watch & Learn
AI-discovered learning video
Data Types and Formats: The IKEA Furniture of AI (Instructions Included)
You can not out-model bad data typing. You can only suffer gloriously.
We framed the problem. We mapped CRISP-DM like pros. Now we open the box labeled 'Data Understanding' and realize it is 10% insights and 90% trying to figure out why a column called age has a value of 'twenty-five-ish'. This session is your survival guide to data types and formats — the difference between a smooth modeling pipeline and a chaos gremlin that eats weekends.
Why this matters (and where it bites in CRISP-DM)
Data types and formats show up everywhere:
- Business Understanding: types determine what is even measurable. You can not A/B test vibes (yet).
- Data Understanding: detect the species of data you have, not the mythical one you want.
- Data Preparation: casting, parsing, encoding — also known as 'Gym for Data'.
- Modeling: models assume types; violate them and you get nonsense or errors.
- Evaluation: metrics change by type. RMSE for numbers, F1 for categories, BLEU/ROUGE for text.
- Deployment: serialization and schemas keep prod from catching fire.
If you choose the wrong type early, every step after that becomes interpretive dance.
The Big Typology: What kind of data is this goblin?
1) Structured (tabular)
- Numeric: integers, floats. Beware 0.1 + 0.2 != 0.3 in floating point land. Use Decimal for money.
- Boolean: True/False. (Not 'Y', 'N', 'maybe', and the pumpkin emoji.)
- Categorical:
- Nominal: unordered labels (color: red, blue).
- Ordinal: ordered labels (small < medium < large), but distances are not equal.
- Datetime/Time: timestamps, date ranges, durations; time zones are spicy.
- Text: often arrives as the dreaded 'object' dtype.
- Identifiers: user_id, sku; they are labels, not numbers. Do not average them.
Measurement scales (psychometrics crash course):
- Nominal (categories), ordinal (order), interval (differences are meaningful; no true zero), ratio (all of the above plus true zero). This matters when choosing encodings and metrics.
Pandas dtype vibe check:
import pandas as pd
df = pd.DataFrame({
'age': pd.Series([23, 35, None], dtype='float64'),
'is_active': pd.Series([True, False, True], dtype='boolean'),
'size': pd.Series(['S','M','L'], dtype='category'),
'signup_ts': pd.to_datetime(['2023-01-01', '2023-01-02', None]),
'notes': pd.Series(['hi', 'ok', '...'], dtype='string')
})
Tip: use category for low-cardinality labels; it saves memory and models love it.
2) Semi-structured
- JSON, YAML, XML, and log lines. They have structure, but it is nested like a family drama. Great for events, configs, and APIs.
- JSON Lines (NDJSON): one JSON object per line. Stream-friendly, analytics-friendly.
3) Unstructured
- Text blobs, PDFs, images, audio, video. Models can extract structure, but the raw format is vibes-first.
- Pro move: store raw, plus a structured index (embeddings, metadata, OCR) so you can search and learn.
File and Storage Formats: Choose your fighter
| Format | Type | Best for | Pros | Watch-outs |
|---|---|---|---|---|
| CSV/TSV | Row, text | Quick tabular exchange | Human-readable, ubiquitous | No schema, commas-in-text drama, big files |
| JSON | Semi-structured | APIs, configs, events | Nested, flexible, web-native | Inconsistent schemas, expensive to parse |
| JSON Lines | Semi-structured | Logs, streaming | Append-friendly, line-by-line | Mixed schemas across lines |
| Parquet | Columnar, binary | Analytics at scale | Compressed, typed, column-pruning | Harder to eyeball, requires libs |
| Avro | Row, binary | Streaming with schema | Strong schema + evolution | You must manage schema registry |
| ORC | Columnar, binary | Hadoop-ish analytics | Compression, predicate pushdown | Ecosystem-specific |
| HDF5 | Hierarchical, binary | Arrays, scientific data | Fast random access, large arrays | Portability concerns |
| TFRecord | Binary | TensorFlow pipelines | Sequential I/O for training | TF-centric, tooling overhead |
| Images (PNG/JPEG) | Binary | Vision | PNG lossless; JPEG small files | Color space, metadata, compression |
| Audio (WAV/MP3) | Binary | Speech/audio | WAV raw; MP3 compressed | Sample rate/bit depth issues |
| Video (MP4/MKV) | Container | Vision/time | Wide support | Container vs codec confusion |
Row vs columnar: row formats (JSON, Avro) are great for writes and transactions; columnar (Parquet, ORC) are great for analytics and SELECT only what you need.
Compression: gzip good, zstd great, snappy fast. Columnar formats compress per-column with magic.
Encoding, Precision, and Time: The unholy trinity
- Text encoding: use UTF-8 like it is 2025. Watch for BOMs and curly quotes. Normalize Unicode; 'é' can be one code point or two.
- Floats: do not compare for exact equality. For money, use Decimal or integer cents.
- Time zones: store UTC internally; display in local. Beware DST gaps and overlaps.
import pandas as pd
s = pd.to_datetime(['2021-03-14 01:59', '2021-03-14 02:01']).tz_localize('US/Pacific', nonexistent='shift_forward')
# DST jumps at 2am; nonexistent times must be handled explicitly
Schemas: Your contract with future-you
A schema states: field names, types, nullability, and constraints.
- JSON Schema for JSON; Avro schemas for Avro; Parquet has embedded schema.
- Python validation: Pydantic or Marshmallow.
- Data quality: Great Expectations to assert 'column X is non-null and between 0 and 1'.
Schema evolution is real: add optional fields, deprecate carefully, version everything.
Tidy data: The Marie Kondo of tables
Tidy rules:
- Each variable is a column.
- Each observation is a row.
- Each value is a cell.
Melting and pivoting turn reporting Franken-tables into model-ready delight.
df_tidy = df.melt(id_vars=['user_id'], var_name='metric', value_name='value')
Text, Images, Audio, Video: The sensory buffet
- Text: keep raw text plus processed features (tokens, embeddings). Store language code. Normalize whitespace and Unicode.
- Images: arrays (H x W x C). Channels: RGB vs BGR shenanigans. Know color spaces (RGB, HSV). Keep EXIF with caution (it can leak GPS).
- Audio: sample rate (e.g., 16 kHz for speech), bit depth (16-bit), mono vs stereo. Spectrograms become 2D arrays you can CNN.
- Video: frames + time. Understand container (MP4) vs codec (H.264). Extract frames with consistent FPS.
Databases and Streams
- SQL (relational): strict schema, ACID, great for OLTP and clean joins.
- NoSQL flavors: document (MongoDB), key-value (Redis), wide-column (Cassandra), graph (Neo4j). Pick based on access pattern.
- Streams: Kafka, Kinesis. Message payloads often JSON/Avro/Protobuf. Use a schema registry.
Practical transformations that save projects
- Cast early and loudly:
import pandas as pd
usecols = ['price_cents','created_at','country']
df = pd.read_csv('orders.csv', usecols=usecols, dtype={'price_cents': 'Int64', 'country': 'string'})
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['country'] = df['country'].astype('category')
- Missing values: distinguish NA (unknown) from empty string (known empty) and zero (real zero). Use sentinel categories like 'Unknown' for labels.
- Categorical encodings: one-hot for tree models, target/ordinal encoding for high-cardinality. For deep learning, embeddings.
- Sparse data: use sparse matrices when 99% zeros.
- Save columnar for analytics:
# Round-trip Parquet
import pandas as pd
df.to_parquet('clean.parquet', index=False)
restored = pd.read_parquet('clean.parquet')
Security and ethics footguns
- Pickle is convenient and also a remote code execution party if untrusted. Avoid for interchange.
- CSV injection: fields starting with '=' can trigger spreadsheets. Escape or prefix.
- PII: redact or tokenize. Hash emails with a salt; do not log secrets.
- PDFs and images: strip metadata if distributing.
If your dataset has vibes, your model will have a mood; if your dataset leaks PII, your model will have a lawyer.
Mini case: e-commerce soup
You have orders (tabular), product images (PNG), reviews (text), and clickstream (JSON Lines). Sensible choices:
- Store raw events as JSON Lines; batch them into Parquet for analytics.
- Orders in a SQL warehouse; ensure monetary types use integers or Decimal.
- Images in object storage; keep a table with paths, size, and labels.
- Reviews as text; compute sentiment and embeddings, store as numeric arrays in Parquet or a vector DB.
- Define schemas for each and validate in CI with Great Expectations.
Now modeling loves you: gradient-boosted trees for tabular, CNN for images, transformer for text, and a late-fusion model for combined predictions.
Common gotchas (aka why your model is weird)
- Mixed types in a column: '3', 3, and 'three' are not a personality; they are a bug.
- Datetime parsed as local on one machine and UTC on another.
- Categories re-encoded with different mappings between train and prod.
- JSON fields that drift over time (field missing or renamed silently).
- Training on JPEG-compressed images and evaluating on PNG; distribution shift via subtle compression artifacts.
TL;DR and marching orders
- Choose representations on purpose: type first, format second.
- Prefer columnar (Parquet) for analytics and typed schemas for stability.
- Normalize text and time; treat money like it owes you receipts.
- Validate with schemas and tests; version data like code.
- Keep raw data immutable; derive clean, typed, model-ready layers.
Data types and formats are not housekeeping. They are strategy. Get them right, and the rest of CRISP-DM becomes a highlight reel instead of a blooper montage.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!