Data Sources, Engineering, and Deployment
Acquire data from files, web, and databases; then test, package, version, and deploy reliable services.
Content
SQL Fundamentals
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
SQL Fundamentals — The Query Fuel for Data Science Pipelines
You already know how to fetch data from REST APIs and keep secrets safe with tokens. Now let’s learn how to store, query, and shape that data in a database so your models actually have something reliable to chew on.
Why SQL matters in the pipeline
If REST APIs are the faucets that pour raw data into your system, SQL is the plumbing that routes, filters, aggregates, and stores it in the shape your models and applications expect. After training models in the Deep Learning Foundations module, you learned to feed tensors into networks — now learn how to serve, slice, and maintain the tabular data that often becomes those tensors.
SQL is everywhere in production: analytics, feature stores, logging, experiment tracking, and even small projects using SQLite or managed cloud databases.
What this section covers (quick roadmap)
- Core SQL operations (CRUD) with concrete examples
- Joins, GROUP BY and window functions for real-world feature engineering
- Indexes and transactions — why DBs don’t collapse under load
- Safe querying from Python (parametrized queries, Pandas integration) — connecting the dots with previous REST & Auth topics
Basic building blocks: CRUD in 60 seconds
- CREATE — make tables
- READ — SELECT rows (aka the query language)
- UPDATE — change rows
- DELETE — remove rows
Example schema and insert:
CREATE TABLE users (
id INTEGER PRIMARY KEY,
email TEXT UNIQUE,
signup_ts TIMESTAMP,
is_active BOOLEAN
);
INSERT INTO users (email, signup_ts, is_active)
VALUES ('alice@example.com', '2024-01-10', 1);
Micro explanation: Primary keys uniquely identify rows. Unique prevents duplicate emails. Timestamps let you compute recency features for models.
SELECT, WHERE, LIMIT — the query trifecta
SELECT id, email
FROM users
WHERE is_active = 1
ORDER BY signup_ts DESC
LIMIT 100;
Why it matters: this is how you get the recent active users to compute features or send to a training job you set up in the Deep Learning Foundations section.
JOINs — putting related tables together (the social mixer)
Analogy: Tables are guests at a party. JOINs are the conversations.
- INNER JOIN: only people who know each other
- LEFT JOIN: everyone from the left table, with data from the right if it exists
- RIGHT JOIN / FULL OUTER JOIN: similar but less common, depending on DB
Example: combine users and purchases to make a feature table
SELECT u.id, u.email, COUNT(p.id) AS purchase_count, MAX(p.amount) AS max_purchase
FROM users u
LEFT JOIN purchases p ON u.id = p.user_id
WHERE u.signup_ts >= '2024-01-01'
GROUP BY u.id, u.email;
Tip: use LEFT JOIN when you want users even if they have zero purchases (important for negative examples in model training).
Aggregation and GROUP BY — engineer features at scale
Common aggregations: COUNT, SUM, AVG, MAX, MIN
Use GROUP BY to compute per-entity features (per user, per device, per cohort). This is the mini ETL that often feeds ML experiments.
Micro example: churn features
SELECT user_id,
COUNT(*) FILTER (WHERE event_type = 'login') AS logins_last_30d,
SUM(CASE WHEN purchase_date >= DATE('now', '-30 days') THEN 1 ELSE 0 END) AS purchases_30d
FROM events
GROUP BY user_id;
Note: syntax like FILTER and DATE functions vary across DBs (Postgres, MySQL, SQLite) — small differences, big consequences.
Window functions — the elegant time-travel tool
Window functions let you compute features without collapsing rows into groups (useful for per-row context):
SELECT
user_id,
event_ts,
event_type,
COUNT(*) OVER (PARTITION BY user_id ORDER BY event_ts RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW) AS events_last_7d
FROM events;
Use-cases: rolling averages, ranks, time-since-last-event — essential for sequence modeling and time-series features before you feed data to a neural network.
Indexing & Transactions: performance and safety
- Indexes: like an index in a book — speed up reads, slow down writes and use space. Index columns you filter or join on frequently (e.g., user_id, email).
- Transactions: group multiple statements into an atomic unit — either all succeed or none do. Use transactions for multi-step writes (e.g., updating user balance and writing a ledger entry).
Quick rule: index selectively. Measure with EXPLAIN. Blindly indexing everything is like giving everyone a jetpack — chaotic and expensive.
Querying from Python (securely) — connect the dots with Auth & REST
You’ve used tokens to authenticate to APIs. Similarly, databases have credentials. Don’t hardcode them — use secret managers or environment variables.
Small example with sqlite3 + pandas (local, no credentials needed):
import sqlite3
import pandas as pd
conn = sqlite3.connect('data.db')
q = "SELECT id, email FROM users WHERE is_active = 1 LIMIT ?"
# parameterized to avoid SQL injection
df = pd.read_sql_query(q, conn, params=(100,))
print(df.head())
conn.close()
If you connect to a cloud DB, combine this with your Auth & Tokens lessons: request temporary credentials from your token service, then create the DB session.
Quick comparison table
| Operation | SQL Example | When to use |
|---|---|---|
| Read subsets | SELECT ... WHERE | Exploratory analysis, features |
| Aggregate | GROUP BY, SUM/COUNT | Feature engineering, dashboards |
| Combine tables | JOINs | Enrich data for training |
| Time/window | WINDOW functions | Time-series or sequence features |
| Safety | Transactions | Consistent writes, experiment metadata |
| Speed | Indexes | High-read systems / online features |
Common pitfalls (and how to avoid them)
- SELECT * in production — avoid; explicit columns are safer and faster.
- Doing heavy joins per web-request — precompute (materialized views or feature tables) for low latency.
- Ignoring NULLs — handle nulls explicitly when creating features.
- String concatenation for queries — always parameterize to prevent injection.
Closing: key takeaways
- SQL is the glue between raw data (from APIs) and training-ready datasets (for models from Deep Learning Foundations). Use it to filter, aggregate, and create features.
- JOINs and GROUP BY are your most-used tools for feature building; learn window functions next for advanced temporal features.
- Index wisely and use transactions to ensure performance and safety in production.
- From Python, query securely with parameterized statements and load results into Pandas for the rest of your ML pipeline.
"Think of SQL as a high-powered kitchen: if your ingredients (data) are messy, even the best chef (model) can’t make a Michelin dish. SQL does the mise en place."
Summary: Master the core SQL constructs (SELECT, JOIN, GROUP BY, window functions), understand indexes and transactions, and practice secure, efficient queries from Python. This will let you move cleanly from fetching data with APIs to training and deploying models that actually work well in the wild.
Tags: beginner, data-engineering, SQL, python
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!