Understanding Data
Learn the data concepts that underpin effective AI systems.
Content
Data types and modalities
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Data Types & Modalities — The Chaotic Symphony of Inputs
"Data is the raw matter of intelligence; how you treat it decides whether you build a cathedral or a papier-mâché volcano." — definitely a TA I made up
You're coming off 'Machine Learning Essentials' (where we sampled algorithm families and debated batch vs online inference like caffeine-fueled philosophers). Now we zoom in on the stuff those models actually swallow: the data. This is not a rerun of 'what's a dataset?' — instead, think of this as a deep-dive into the different flavors of data (types) and the channels they arrive through (modalities), and why that matters for model choice, deployment, and real-world performance.
Why this chapter matters (and why your model will fail otherwise)
- Pick the wrong algorithm for the data type and you'll be sad (and wrong). Remember our chat about common algorithm families? Some are built for tabular numbers, others for sequences or images.
- Deployment constraints (latency, memory, streaming vs batch) are shaped by data modality. A 4K video stream is a different beast than a CSV row.
- Preprocessing, labeling cost, and failure modes (class imbalance, concept drift) vary radically by type.
Imagine treating audio like text, or graphs like images. It’s like trying to wear sunglasses in a pitch-dark room — useless and slightly tragic.
Big distinctions: Data type vs Modality
- Data type = how the data is structured: numbers, categories, timestamps, etc. It's about schema and primitives.
- Modality = the sensory channel or format: text, image, audio, video, graphs, time-series, sensors, 3D point clouds.
Think: data type is the brick; modality is whether you're building a wall, a sculpture, or a hoverboard.
Common data types (and the real-world things they map to)
Numerical (continuous / discrete)
- Examples: temperature readings, prices, counts.
- Favored models: linear models, tree ensembles, neural nets.
Categorical
- Examples: country, gender, product id.
- Needs encoding (one-hot, target, embedding).
Ordinal
- Examples: survey ratings (1-5), education level.
- Preserve order when encoding.
Text (string)
- Examples: reviews, logs, transcripts.
- Requires tokenization, embeddings, or language models.
Time-series
- Examples: stock prices, IoT sensor data.
- Needs temporal features, windowing, seasonality handling.
Graphs / Networks
- Examples: social networks, molecules, knowledge graphs.
- Requires GNNs or graph algorithms.
Images / Video / Audio / 3D
- Examples: photos, surveillance feeds, speech, LiDAR point clouds.
- Each has specialized pipelines (CNNs, CNN+RNN/Transformer, spectrograms, point-networks).
Mixed / Multimodal
- When two or more modalities are combined (e.g., captioned images, video with audio and text).
Modalities — the sensory palette
Here's a compact table to keep your brain tidy:
| Modality | Characteristics | Typical preprocessing | Example models |
|---|---|---|---|
| Tabular (structured) | Rows x columns, heterogenous types | Imputation, encoding, scaling | XGBoost, Random Forests, MLPs |
| Text | Sequential, discrete tokens | Tokenize, embed, clean | Transformers, RNNs, LMs |
| Images | Spatial grid, high-dim pixels | Resize, normalize, augment | CNNs, Vision Transformers |
| Audio | 1D waveform, time-frequency | Resample, spectrograms | CNNs, RNNs, audio Transformers |
| Video | Sequence of images (+audio) | Frame sampling, compression | 3D CNNs, video Transformers |
| Graphs | Nodes & edges, relational | Node features, adjacency | GNNs, graph algorithms |
| 3D Point Clouds | Unordered points in space | Voxelization, sampling | PointNet, sparse CNNs |
Practical consequences: model choice, labeling, and pipelines
- Tabular data? Tree models (XGBoost) often win in business settings. Deep nets can help but need more data/engineering.
- Text or audio? Pretrained language/speech models save months of effort and are gold for transfer learning.
- Images/videos? Data augmentation and large labelled sets matter; compute and storage grow fast.
- Graphs? If relationships are the signal (fraud rings, molecule bonds), use graph-specific models.
Labeling costs differ wildly: labeling a CSV is cheap; labeling a video frame-by-frame is expensive and slow. That affects whether you can iterate quickly or need active learning.
Multimodal — when your model needs to be an orchestra conductor
Multimodal systems combine inputs: image + text, audio + transcript, sensor arrays + metadata.
Why bother?
- Complementary information -> improved accuracy (e.g., both video and audio supply context).
- Robustness: if one sensor fails, another can fill in.
Challenges:
- Alignment: temporally syncing audio and video, or aligning text tokens and image regions.
- Fusion strategy: early (combine raw features), mid (combine learned features), or late (ensemble outputs).
Quick thought experiment: You're building a real-time captioner for videos (online inference). You must stream audio, transcribe quickly, align to frames, and produce captions with low latency. Now remember the deployment issues we discussed: latency budgets, memory, and fallback behaviors. Multimodality multiplies the constraints.
Data pitfalls and the guardrails you need
- Imbalance — common in medical, fraud datasets. Resampling, class weighting, or anomaly detection strategies.
- Concept drift — especially for time-series or streaming data; monitor and retrain (online vs batch decisions!).
- Label noise — human annotators disagree; use consensus, quality checks, or noise-robust losses.
- Bias — modality-specific prejudices (face datasets with demographic skew) require careful auditing.
- Volume & velocity — video + audio needs lots of storage and throughput; choose streaming pipelines for low-latency online inference.
Small code-like checklist (pseudocode) to decide approach
if modality == 'tabular':
try tree-ensemble
elif modality in ['text','audio']:
use pretrained transformer
elif modality == 'image':
use pretrained CNN/ViT + augment
elif modality == 'graph':
use GNN
if multimodal:
decide alignment + fusion strategy
consider deployment: latency, memory, streaming
Closing: TL;DR and a motivational jab
- Data type = schema/primitives (numbers, categories). Modality = sensory channel (text, image, audio, graph).
- Algorithms love specific modalities: pick wisely. Remember our earlier notes on algorithm families — they’re not interchangeable accessories.
- Deployment choices (online vs batch, compute limits) are dictated by modality: streaming audio is not a batch CSV.
- Multimodal is powerful but expensive in engineering and inference complexity.
Final mic drop: treat data like cuisine. Don’t expect a microwave meal to taste like a chef’s tasting menu. Learn the modality, choose the right recipe, and the model will actually feed your project instead of eating it.
Key takeaways:
- Map modality -> preprocessing -> model family -> deployment pattern.
- Audit for bias, drift, and label noise early.
- When in doubt: prototype simple, validate fast, and scale thoughtfully.
Now go look at your dataset like a food critic with a clipboard. What modality is it? What’s the cheapest, nastiest thing that will make your model fail? Fix that first.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!