jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Full Stack AI and Data Science Professional
Chapters

1Foundations of AI and Data Science

AI vs Data Science landscapeRoles and workflowsProject lifecycle CRISP-DMProblem framingData types and formatsMetrics and evaluation basicsReproducibility and versioningNotebooks vs scriptsEnvironments and dependenciesCommand line essentialsGit and branchingData ethics and bias overviewPrivacy and governance basicsExperiment tracking overviewReading research papers

2Python for Data and AI

3Math for Machine Learning

4Data Acquisition and Wrangling

5SQL and Data Warehousing

6Exploratory Data Analysis and Visualization

7Supervised Learning

8Unsupervised Learning and Recommendation

9Deep Learning and Neural Networks

10NLP and Large Language Models

11MLOps and Model Deployment

12Data Engineering and Cloud Pipelines

Courses/Full Stack AI and Data Science Professional/Foundations of AI and Data Science

Foundations of AI and Data Science

47 views

Core concepts, roles, workflows, and ethics that frame end‑to‑end AI projects.

Content

5 of 15

Data types and formats

Formats, Feels, and Furious Dtypes
6 views
intermediate
humorous
science
software engineering
gpt-5
6 views

Versions:

Formats, Feels, and Furious Dtypes

Watch & Learn

AI-discovered learning video

YouTube

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Data Types and Formats: The IKEA Furniture of AI (Instructions Included)

You can not out-model bad data typing. You can only suffer gloriously.

We framed the problem. We mapped CRISP-DM like pros. Now we open the box labeled 'Data Understanding' and realize it is 10% insights and 90% trying to figure out why a column called age has a value of 'twenty-five-ish'. This session is your survival guide to data types and formats — the difference between a smooth modeling pipeline and a chaos gremlin that eats weekends.


Why this matters (and where it bites in CRISP-DM)

Data types and formats show up everywhere:

  1. Business Understanding: types determine what is even measurable. You can not A/B test vibes (yet).
  2. Data Understanding: detect the species of data you have, not the mythical one you want.
  3. Data Preparation: casting, parsing, encoding — also known as 'Gym for Data'.
  4. Modeling: models assume types; violate them and you get nonsense or errors.
  5. Evaluation: metrics change by type. RMSE for numbers, F1 for categories, BLEU/ROUGE for text.
  6. Deployment: serialization and schemas keep prod from catching fire.

If you choose the wrong type early, every step after that becomes interpretive dance.


The Big Typology: What kind of data is this goblin?

1) Structured (tabular)

  • Numeric: integers, floats. Beware 0.1 + 0.2 != 0.3 in floating point land. Use Decimal for money.
  • Boolean: True/False. (Not 'Y', 'N', 'maybe', and the pumpkin emoji.)
  • Categorical:
    • Nominal: unordered labels (color: red, blue).
    • Ordinal: ordered labels (small < medium < large), but distances are not equal.
  • Datetime/Time: timestamps, date ranges, durations; time zones are spicy.
  • Text: often arrives as the dreaded 'object' dtype.
  • Identifiers: user_id, sku; they are labels, not numbers. Do not average them.

Measurement scales (psychometrics crash course):

  • Nominal (categories), ordinal (order), interval (differences are meaningful; no true zero), ratio (all of the above plus true zero). This matters when choosing encodings and metrics.

Pandas dtype vibe check:

import pandas as pd

df = pd.DataFrame({
    'age': pd.Series([23, 35, None], dtype='float64'),
    'is_active': pd.Series([True, False, True], dtype='boolean'),
    'size': pd.Series(['S','M','L'], dtype='category'),
    'signup_ts': pd.to_datetime(['2023-01-01', '2023-01-02', None]),
    'notes': pd.Series(['hi', 'ok', '...'], dtype='string')
})

Tip: use category for low-cardinality labels; it saves memory and models love it.

2) Semi-structured

  • JSON, YAML, XML, and log lines. They have structure, but it is nested like a family drama. Great for events, configs, and APIs.
  • JSON Lines (NDJSON): one JSON object per line. Stream-friendly, analytics-friendly.

3) Unstructured

  • Text blobs, PDFs, images, audio, video. Models can extract structure, but the raw format is vibes-first.
  • Pro move: store raw, plus a structured index (embeddings, metadata, OCR) so you can search and learn.

File and Storage Formats: Choose your fighter

Format Type Best for Pros Watch-outs
CSV/TSV Row, text Quick tabular exchange Human-readable, ubiquitous No schema, commas-in-text drama, big files
JSON Semi-structured APIs, configs, events Nested, flexible, web-native Inconsistent schemas, expensive to parse
JSON Lines Semi-structured Logs, streaming Append-friendly, line-by-line Mixed schemas across lines
Parquet Columnar, binary Analytics at scale Compressed, typed, column-pruning Harder to eyeball, requires libs
Avro Row, binary Streaming with schema Strong schema + evolution You must manage schema registry
ORC Columnar, binary Hadoop-ish analytics Compression, predicate pushdown Ecosystem-specific
HDF5 Hierarchical, binary Arrays, scientific data Fast random access, large arrays Portability concerns
TFRecord Binary TensorFlow pipelines Sequential I/O for training TF-centric, tooling overhead
Images (PNG/JPEG) Binary Vision PNG lossless; JPEG small files Color space, metadata, compression
Audio (WAV/MP3) Binary Speech/audio WAV raw; MP3 compressed Sample rate/bit depth issues
Video (MP4/MKV) Container Vision/time Wide support Container vs codec confusion

Row vs columnar: row formats (JSON, Avro) are great for writes and transactions; columnar (Parquet, ORC) are great for analytics and SELECT only what you need.

Compression: gzip good, zstd great, snappy fast. Columnar formats compress per-column with magic.


Encoding, Precision, and Time: The unholy trinity

  • Text encoding: use UTF-8 like it is 2025. Watch for BOMs and curly quotes. Normalize Unicode; 'é' can be one code point or two.
  • Floats: do not compare for exact equality. For money, use Decimal or integer cents.
  • Time zones: store UTC internally; display in local. Beware DST gaps and overlaps.
import pandas as pd
s = pd.to_datetime(['2021-03-14 01:59', '2021-03-14 02:01']).tz_localize('US/Pacific', nonexistent='shift_forward')
# DST jumps at 2am; nonexistent times must be handled explicitly

Schemas: Your contract with future-you

A schema states: field names, types, nullability, and constraints.

  • JSON Schema for JSON; Avro schemas for Avro; Parquet has embedded schema.
  • Python validation: Pydantic or Marshmallow.
  • Data quality: Great Expectations to assert 'column X is non-null and between 0 and 1'.

Schema evolution is real: add optional fields, deprecate carefully, version everything.


Tidy data: The Marie Kondo of tables

Tidy rules:

  • Each variable is a column.
  • Each observation is a row.
  • Each value is a cell.

Melting and pivoting turn reporting Franken-tables into model-ready delight.

df_tidy = df.melt(id_vars=['user_id'], var_name='metric', value_name='value')

Text, Images, Audio, Video: The sensory buffet

  • Text: keep raw text plus processed features (tokens, embeddings). Store language code. Normalize whitespace and Unicode.
  • Images: arrays (H x W x C). Channels: RGB vs BGR shenanigans. Know color spaces (RGB, HSV). Keep EXIF with caution (it can leak GPS).
  • Audio: sample rate (e.g., 16 kHz for speech), bit depth (16-bit), mono vs stereo. Spectrograms become 2D arrays you can CNN.
  • Video: frames + time. Understand container (MP4) vs codec (H.264). Extract frames with consistent FPS.

Databases and Streams

  • SQL (relational): strict schema, ACID, great for OLTP and clean joins.
  • NoSQL flavors: document (MongoDB), key-value (Redis), wide-column (Cassandra), graph (Neo4j). Pick based on access pattern.
  • Streams: Kafka, Kinesis. Message payloads often JSON/Avro/Protobuf. Use a schema registry.

Practical transformations that save projects

  • Cast early and loudly:
import pandas as pd

usecols = ['price_cents','created_at','country']
df = pd.read_csv('orders.csv', usecols=usecols, dtype={'price_cents': 'Int64', 'country': 'string'})
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['country'] = df['country'].astype('category')
  • Missing values: distinguish NA (unknown) from empty string (known empty) and zero (real zero). Use sentinel categories like 'Unknown' for labels.
  • Categorical encodings: one-hot for tree models, target/ordinal encoding for high-cardinality. For deep learning, embeddings.
  • Sparse data: use sparse matrices when 99% zeros.
  • Save columnar for analytics:
# Round-trip Parquet
import pandas as pd
df.to_parquet('clean.parquet', index=False)
restored = pd.read_parquet('clean.parquet')

Security and ethics footguns

  • Pickle is convenient and also a remote code execution party if untrusted. Avoid for interchange.
  • CSV injection: fields starting with '=' can trigger spreadsheets. Escape or prefix.
  • PII: redact or tokenize. Hash emails with a salt; do not log secrets.
  • PDFs and images: strip metadata if distributing.

If your dataset has vibes, your model will have a mood; if your dataset leaks PII, your model will have a lawyer.


Mini case: e-commerce soup

You have orders (tabular), product images (PNG), reviews (text), and clickstream (JSON Lines). Sensible choices:

  • Store raw events as JSON Lines; batch them into Parquet for analytics.
  • Orders in a SQL warehouse; ensure monetary types use integers or Decimal.
  • Images in object storage; keep a table with paths, size, and labels.
  • Reviews as text; compute sentiment and embeddings, store as numeric arrays in Parquet or a vector DB.
  • Define schemas for each and validate in CI with Great Expectations.

Now modeling loves you: gradient-boosted trees for tabular, CNN for images, transformer for text, and a late-fusion model for combined predictions.


Common gotchas (aka why your model is weird)

  • Mixed types in a column: '3', 3, and 'three' are not a personality; they are a bug.
  • Datetime parsed as local on one machine and UTC on another.
  • Categories re-encoded with different mappings between train and prod.
  • JSON fields that drift over time (field missing or renamed silently).
  • Training on JPEG-compressed images and evaluating on PNG; distribution shift via subtle compression artifacts.

TL;DR and marching orders

  • Choose representations on purpose: type first, format second.
  • Prefer columnar (Parquet) for analytics and typed schemas for stability.
  • Normalize text and time; treat money like it owes you receipts.
  • Validate with schemas and tests; version data like code.
  • Keep raw data immutable; derive clean, typed, model-ready layers.

Data types and formats are not housekeeping. They are strategy. Get them right, and the rest of CRISP-DM becomes a highlight reel instead of a blooper montage.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics