Data Science Foundations and Workflow
Understand the data science landscape, roles, workflows, and tools. Learn problem framing, reproducibility, and ethical principles that guide successful projects from idea to impact.
Content
Structured vs Unstructured Data
Versions:
Watch & Learn
AI-discovered learning video
Structured vs Unstructured Data: The Spreadsheet vs The Chaos Goblin
Data science is just organized curiosity. The structure part is the organized bit. The unstructured part is the curiosity yelling from the void.
You already wrangled data types and formats, and you framed problems like a responsible hypothesis adult. Now let’s aim the flashlight at a big fork in the data road: structured vs unstructured data. This choice influences everything from how you store data to which model you trust with your weekend.
Opening: A Tale of Two Datasets
Imagine you are handed two gifts:
- Gift A: A tidy table with columns like customer_id, signup_date, churned. You can feel the order radiating off it.
- Gift B: A folder of email threads, screenshots, and voice notes titled maybe important. The folder whispers chaos.
Both gifts are data gold. But they demand very different workflows. And if you try to treat them the same, your analysis will scream quietly in a corner.
Why does this matter? Because back when we framed hypotheses, we asked things like What predicts churn? and Does feature X affect outcome Y? The type of data you have shapes how you even measure X and Y in the first place.
What do we mean by structure, actually?
- Structured data: Data that lives in a predefined schema. Think rows and columns with datatypes you can validate. SQL loves this.
- Unstructured data: Data without a fixed tabular schema. Text, images, audio, video, PDFs, whole novels written by customers in a feedback form.
- Semi-structured data: Not a neat table, but still has consistent tags or keys. JSON, XML, logs. The vibe is: I do what I want… but also here are some keys.
Structure is a contract. The more structure you have, the easier it is to query, validate, and do math. The less you have, the richer and messier the world you can model.
Structured Data: The Spreadsheet That Pays Taxes
What it looks like
- Rows = entities (users, transactions, sensors)
- Columns = features (age, price, timestamp)
- Examples: CSVs, tables in a relational database, Parquet files
Why it is lovely
- Schema enforces sanity: integers behave like integers
- SQL can query it fast
- Easy to compute aggregates, join tables, and run classic ML (logistic regression, decision trees)
Where it shines
- Churn prediction with customer demographics and usage counts
- A/B test analysis with clear metrics
- Finance and operations dashboards
Favorite tools
- SQL, pandas, dbt, columnar formats like Parquet
Unstructured Data: The Chaos Goblin With Infinite Potential
What it looks like
- Freeform text, images, audio, video, PDFs, social posts, call transcripts
- Stored as blobs or files with some metadata tagged on
Why it’s powerful
- Contains nuance and context that tables flatten out
- Lets you measure things you couldn’t before: sentiment, intent, topics, objects in images, speaker emotion
The tradeoffs
- You need to extract features before doing math
- Annotation can be expensive
- Compute-heavy; pipelines are more complex
Typical playbook
- Text: tokenization, embeddings, topic modeling, classification, summarization
- Images: feature extraction with CNNs or vision transformers, object detection
- Audio: spectrograms, MFCCs, speech-to-text
Favorite tools
- NLP libraries (spaCy), transformer ecosystems (Hugging Face), vision (OpenCV), deep learning (PyTorch, TensorFlow)
Unstructured does not mean useless. It means you have to bring the structure yourself.
Semi-structured: The Middle Child Who Reads The Docs
- JSON, XML, logs, clickstream events
- Not a table, but keys and nesting give it shape
- Lives happily in data lakes, document stores, or gets squashed into tables with ETL
- Great for flexible schemas, evolving products, and streaming systems
Common tools: NoSQL databases, Spark, Kafka, schema-on-read processing
The Spectrum, Not a Binary
Why do people keep misunderstanding this?
- Because language tricks us. Unstructured sounds like trash data. It’s not.
- Structure can be extracted. A PDF invoice becomes a table after OCR and parsing. Now it’s structured.
- Semi-structured logs can be exploded into analytics-ready columns, then modeled.
Imagine this in your everyday life: Your notes app is unstructured when you brain-dump. But the moment you add tags or convert action items into a checklist, you are adding structure. Same for data pipelines.
Quick Comparison Table
| Dimension | Structured | Semi-structured | Unstructured |
|---|---|---|---|
| Schema | Fixed, declared | Flexible keys, nested | None enforced |
| Examples | SQL tables, CSVs | JSON, XML, logs | Text, images, audio, video |
| Storage | Relational DB, warehouses | Document stores, data lakes | Object storage, file systems |
| Querying | SQL, fast joins | Schema-on-read, JSON queries | Search, vector similarity, metadata filters |
| Preprocessing | Cleaning, encoding | Parsing, flattening | Feature extraction, embeddings, OCR |
| Typical models | Linear/logistic, trees, boosting | Trees after flattening, or sequence models | NLP, CV, audio models, multimodal |
| Metrics | AUC, RMSE, MAE | Same as structured after transformation | F1, BLEU, ROUGE, mAP, WER, retrieval metrics |
Workflow Consequences: Choose Your Adventure
Remember from problem framing: you need measurable variables and a path from data to decision. Your path changes by data type.
If your data is structured
- Define target and features clearly
- Validate types, handle missingness, fix outliers
- Split data, train baseline models, iterate
- Document feature lineage in your warehouse
If your data is semi-structured
- Parse and normalize (flatten JSON, standardize timestamps)
- Decide which fields become columns vs arrays
- Store raw and parsed versions
- Proceed like structured
If your data is unstructured
- Decide the task: classification, extraction, generation, retrieval
- Annotate or weak-label if needed
- Extract features (embeddings, image features, transcriptions)
- Option A: Train a task-specific model
- Option B: Use pretrained models and fine-tune or prompt
- Store derived features for reuse (feature store or vector DB)
Here is a tiny pseudocode sketch of a hybrid pipeline:
if data.type == 'structured':
X = clean_encode(table)
model = train_baseline(X, y)
elif data.type == 'semi':
table = flatten(json_docs)
X = clean_encode(table)
model = train_baseline(X, y)
else: # unstructured
if modality == 'text':
embeddings = embed(text_docs)
elif modality == 'image':
embeddings = vision_features(images)
elif modality == 'audio':
embeddings = audio_features(clips)
X = concat(embeddings, metadata)
model = train_classifier(X, y)
Two Real-World Mashups
1) Support tickets: predicting escalation
- Structured: product, customer tier, time to first reply
- Unstructured: message body, attachments, sentiment
- Approach: extract text embeddings, combine with tabular features, train a classifier; use subject line as a strong hint but watch for leakage
- Bonus move: topic modeling to inform staffing and FAQs
Engaging question: What happens to your hypothesis if sentiment flips from negative to neutral after a first response? You need time-aware features and maybe sequence models.
2) Predictive maintenance: will this machine cry soon
- Structured: sensor readings every minute
- Unstructured: technician notes, machine audio
- Approach: time series features plus audio anomaly detection; cross-check with notes for ground truth
- Lesson: labeling is a budget line item, not a nice-to-have
Pitfalls and Power-ups
- Beware silent schema drift: semi-structured fields appearing or disappearing over time
- Unstructured privacy landmines: text often leaks names, addresses, secrets; redact before processing
- Compute budgeting: unstructured feature extraction is the hungry beast; cache and reuse embeddings
- Evaluation alignment: for unstructured tasks, accuracy may mislead; pick task-appropriate metrics
- Governance: store raw, processed, and feature-level lineage; you will thank yourself during audits
The model is only as honest as the features you made and the labels you trusted.
How structure ties back to hypothesis work
When framing hypotheses, you asked What would I measure if I could. Structure answers How will I measure it today. For unstructured data, the measurement step is a model itself. For example, measuring customer sentiment is not a column you were given; it is a feature you extracted with an NLP model that has its own error bars. Acknowledge that uncertainty in your conclusions.
Quick Tooling Map
- Structured: SQL, pandas, scikit-learn, dbt, Parquet
- Semi-structured: Spark, Kafka, document stores; UDFs to parse JSON; schema registries
- Unstructured: OCR, NLP libraries, vector databases for retrieval, deep learning frameworks
Use the warehouse when your schema is stable and analytics-heavy. Use the lake when you ingest raw artifacts and worry about structure later.
Summary and Takeaways
- Structured data is the neat table ready for math. Unstructured is the raw world waiting to be distilled. Semi-structured is your flexible friend.
- Structure affects storage, preprocessing, modeling, metrics, and budgets.
- The spectrum matters: you can add structure to unstructured data; you can relax structure when you need flexibility.
- In workflows, unstructured tasks add a feature extraction step and often a labeling step. Plan for them.
- Tie back to your hypothesis: define what you measure, and be explicit when a model creates that measurement.
Final thought: Structure is not the enemy of richness. It is the scaffolding that lets complexity climb without collapsing.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!