Data Science Foundations and Workflow
Understand the data science landscape, roles, workflows, and tools. Learn problem framing, reproducibility, and ethical principles that guide successful projects from idea to impact.
Content
Roles in a Data Team
Versions:
Watch & Learn
AI-discovered learning video
Roles in a Data Team: The Heist Crew Edition
“Data science isn’t a solo sport. It’s a very nerdy Avengers film where Excel can kill you faster than Thanos.”
We already defined what data science is and why it matters (see: previous episode, where we convinced you that data is less like oil and more like sourdough starter). Today: who actually does the work? Because “data person” is not a single job—it's a squad.
Think of the data team as a restaurant. If you want a Michelin-star dish, you don’t ask the pastry chef to also fix the walk-in freezer and run marketing. Similarly, a clean, useful, ethical model requires many roles working together—each with a mission, toolkit, and vibe.
The Cast: Who’s in the Room (and Why)
Here’s the TL;DR roster. Bookmark this. Tattoo it. Or at least send it to your PM.
| Role | Core Quest | Common Tools | Primary Deliverables |
|---|---|---|---|
| Data Engineer | Make data flow reliably | SQL, Python/Scala, Spark, Airflow, Kafka, cloud storage | Ingestion pipelines, data lakes/warehouses |
| Analytics Engineer | Make data understandable | dbt, SQL, versioned models, tests | Cleaned tables, metrics layer, documentation |
| Data Analyst | Turn questions into insights | SQL, BI tools (Looker, Power BI), Python/R | Dashboards, analyses, A/B readouts |
| Data Scientist | Build and validate models | Python/R, notebooks, scikit-learn, stats | Experiments, models, insights, lift estimates |
| ML Engineer | Ship and scale ML | Python, Docker, CI/CD, feature stores, model serving | APIs, deployment pipelines, monitoring |
| Data Product Manager | Align work with business value | Roadmaps, user research, OKRs | Problem framing, prioritization, success metrics |
| Data Architect | Design the big picture | Cloud architecture, security, lineage | Schemas, governance standards, cost strategy |
| Data Steward/Governance | Keep data legal and sane | Catalogs, PII policies, quality rules | Data dictionary, access policies, audits |
Also sighted in the wild: BI Developer (focused on dashboards), Statistician (deep inference), Research Scientist (novel methods), Data QA (tests everything), and Subject-Matter Expert (the person who knows what “active user” actually means).
How the Workflow Actually Flows
Remember our Data Science lifecycle? We had questions, data, modeling, deployment, and feedback loops. Here’s who leads when:
- Define the problem (PM + Analyst + SME)
- “Are we trying to reduce churn by 10% or summon the data demons?”
- Acquire and validate data (Data Engineer + Steward)
- Source, permissions, PII handling, quality checks.
- Model the business in the warehouse (Analytics Engineer)
- Transform raw logs into clean, testable tables.
- Explore and hypothesize (Data Scientist + Analyst)
- EDA, feature ideas, back-of-the-envelope math.
- Experiment and model (Data Scientist)
- Baselines, CV, metrics, error analysis.
- Ship it (ML Engineer)
- Containerize, deploy, monitor. No, the notebook is not prod.
- Observe and iterate (Everyone)
- Dashboards, alerts, post-mortems, v2 roadmap.
If it’s not monitored, it’s not deployed. It’s performance art.
A Day in the Life: Churn Prediction, But Make It Real
Imagine a subscription app is bleeding users. Crying. Screaming. CFO hyperventilating into a spreadsheet.
- PM frames the bet: “If we can flag at-risk users 2 weeks early, we can save 8% of churn via targeted offers.” Success metric: retained users at 90 days.
- Analyst defines current churn and segments. Finds retention cliff at day 14. Spicy.
- Data Engineer ingests event logs from mobile/web, ensures user IDs aren’t a game of bingo.
- Steward ensures we don’t email people who opted out. Because fines are spicy too.
- Analytics Engineer builds a dbt model: one row per user per week with clean features (sessions, support tickets, plan type) and tests them.
- Data Scientist prototypes a baseline (logistic regression), then adds features and compares to XGBoost. AUC improves from 0.68 to 0.79. Calibrates probabilities.
- ML Engineer productionizes: feature store for live signals, model service endpoint, shadow deployment, latency <80ms.
- Analyst + PM run an A/B test: targeted retention offer vs. control. Net lift = 6.7% with solid confidence.
- Everyone monitors drift and ROI. Finance buys cookies.
Notice the baton passes. Notice how nobody tried to be all the roles at once. That’s the magic.
RACI: Who Does What (and Who Approves the Chaos)
A tiny, honest example for a model launch:
churn_model_v1:
requirements_doc:
responsible: [data_product_manager]
accountable: [head_of_data]
consulted: [analyst, sme]
informed: [legal]
feature_table:
responsible: [analytics_engineer]
accountable: [data_architect]
consulted: [data_scientist]
informed: [ml_engineer]
training_pipeline:
responsible: [data_scientist]
accountable: [head_of_ml]
consulted: [ml_engineer]
informed: [analyst]
deployment:
responsible: [ml_engineer]
accountable: [head_of_ml]
consulted: [security, sre]
informed: [pm, analyst]
monitoring:
responsible: [ml_engineer, analyst]
accountable: [head_of_data]
consulted: [data_scientist]
informed: [business_ops]
Clarity prevents calendar crimes.
Anti-Patterns (a.k.a. How to Accidentally Sabotage Your Team)
- The Lone Wolf Data Scientist: Brilliant model, nowhere to run it. Dies in a notebook.
- ETL by Vibes: Unversioned SQL scattered across 14 dashboards. Metrics never match. Trust evaporates.
- Dashboard Theater: Beautiful charts, zero decisions. KPI karaoke.
- PM-less Chaos: Everyone sprinting, no North Star, budget becomes interpretive dance.
- Governance Last: “We’ll fix PII later.” Later is subpoenas.
Collaboration Blueprints That Actually Work
- Shared glossary + metrics layer: One definition of “active user,” not five.
- Git all the things: dbt, notebooks (nbdev/Jupytext), ML code, infra as code. PR reviews = shared brain.
- Contracts between layers: Schemas and SLAs for data products. Break it, you bought it.
- Experiment registry: Track hypotheses, datasets, metrics, and results. Repeatability = credibility.
- Observability: Data tests (dbt tests, Great Expectations), model monitoring (drift, latency, fairness), and incident playbooks.
What Tools Go Where? (Because Tool Sprawl Is Real)
- Ingestion + Orchestration: Airflow, Dagster, Prefect
- Storage + Compute: Snowflake/BigQuery/Redshift, Spark/Databricks
- Transform + Metrics: dbt, semantic layers (LookML, MetricFlow)
- Analysis + Viz: Jupyter, RStudio, Looker, Power BI, Mode
- Modeling: scikit-learn, XGBoost, PyTorch, TensorFlow
- MLOps: MLflow, Weights & Biases, Feature Stores, BentoML, SageMaker
- Governance: Data catalogs (DataHub, Amundsen), access control, PII scanners
The rule of thumb: choose boring, reliable tools until your scale demands fancy.
Career Ladders and Crossovers (Yes, You Can Switch Lanes)
- Analyst → Analytics Engineer: If you love SQL craftsmanship and reproducibility.
- Analyst → Data Scientist: If you’re into inference, modeling, experiments.
- Data Scientist → ML Engineer: If you enjoy shipping and infra.
- Data Engineer → Architect: If big-picture design and cost/perf trade-offs thrill you.
- Any → PM: If you’re allergic to ambiguity and love herding cats with Gantt charts.
Your superpower is not the tool; it’s the taste for trade-offs.
Quick Self-Check: Who Do You Call When…
- The nightly pipeline failed and dashboards are blank? Data Engineer.
- Two teams argue about “conversion rate” definitions? Analytics Engineer + Steward.
- You need to choose between A/B test vs. quasi-experiment? Data Scientist + Analyst.
- The model is great offline but flops in prod? ML Engineer (then Data Scientist).
- The CEO wants a roadmap that saves money and makes money? Data PM + Architect.
Mini Script: From Question to Production
Business Question → Analyst + PM → clarified KPI and users
Raw Data → Data Engineer → ingested with quality checks
Clean Views → Analytics Engineer → modeled + tested tables
Model → Data Scientist → trained, validated, documented
Service → ML Engineer → deployed + monitored
Governance → Steward/Architect → compliant, cost-aware
Iteration → Everyone → lessons → v2 hypotheses
That linear list hides loops. In reality, it’s a spiral staircase: every turn gives you better views and a slight fear of heights.
Closing: The One-Slide Summary
- Roles exist to reduce cognitive overload and increase reliability.
- Great teams design handoffs, definitions, and monitoring before the first model.
- Value = problem clarity × data quality × deployment discipline. Zero out any one, and the product is zero.
Put humans in the loop, put tests on the data, and put humility in the roadmap.
Next up, we’ll dive into the data lifecycle mechanics you can practice—so you’re not just admiring the team, you’re playing your position like an all-star.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!