Chapters

1Data Science Foundations and Workflow

What is Data Science Roles in a Data Team Data Science Lifecycle CRISP DM and OSEMN Problem Framing and Hypotheses Data Types and Formats Structured vs Unstructured Data Reproducibility and Version Control Basics Notebooks vs Scripts Environments and Package Management Data Ethics and Bias Overview Experiment Tracking Concepts Documentation and Reporting Basics Project Scoping and KPIs Essential Tools Overview

2Python Programming Essentials for Data Science

3Working with Data Sources and SQL

4Data Wrangling with NumPy and Pandas

5Data Cleaning and Preprocessing

6Exploratory Data Analysis and Visualization

7Probability and Statistics for Data Science

8Machine Learning Foundations

9Supervised Learning Algorithms

10Unsupervised Learning and Dimensionality Reduction

11Model Evaluation, Validation, and Tuning

12Feature Engineering and ML Pipelines

13Time Series Analysis and Forecasting

14Natural Language Processing

15Deep Learning, Deployment, and MLOps

Courses/Data Science : Begineer to Advance/Data Science Foundations and Workflow

Data Science Foundations and Workflow

98 views

Understand the data science landscape, roles, workflows, and tools. Learn problem framing, reproducibility, and ethical principles that guide successful projects from idea to impact.

Content

2 of 15

Roles in a Data Team

Assemble the Nerdvengers

9 views

beginner

humorous

science

education theory

gpt-5

9 views

Versions:

Assemble the Nerdvengers

Watch & Learn

AI-discovered learning video

YouTube

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Roles in a Data Team: The Heist Crew Edition

“Data science isn’t a solo sport. It’s a very nerdy Avengers film where Excel can kill you faster than Thanos.”

We already defined what data science is and why it matters (see: previous episode, where we convinced you that data is less like oil and more like sourdough starter). Today: who actually does the work? Because “data person” is not a single job—it's a squad.

Think of the data team as a restaurant. If you want a Michelin-star dish, you don’t ask the pastry chef to also fix the walk-in freezer and run marketing. Similarly, a clean, useful, ethical model requires many roles working together—each with a mission, toolkit, and vibe.

The Cast: Who’s in the Room (and Why)

Here’s the TL;DR roster. Bookmark this. Tattoo it. Or at least send it to your PM.

Role	Core Quest	Common Tools	Primary Deliverables
Data Engineer	Make data flow reliably	SQL, Python/Scala, Spark, Airflow, Kafka, cloud storage	Ingestion pipelines, data lakes/warehouses
Analytics Engineer	Make data understandable	dbt, SQL, versioned models, tests	Cleaned tables, metrics layer, documentation
Data Analyst	Turn questions into insights	SQL, BI tools (Looker, Power BI), Python/R	Dashboards, analyses, A/B readouts
Data Scientist	Build and validate models	Python/R, notebooks, scikit-learn, stats	Experiments, models, insights, lift estimates
ML Engineer	Ship and scale ML	Python, Docker, CI/CD, feature stores, model serving	APIs, deployment pipelines, monitoring
Data Product Manager	Align work with business value	Roadmaps, user research, OKRs	Problem framing, prioritization, success metrics
Data Architect	Design the big picture	Cloud architecture, security, lineage	Schemas, governance standards, cost strategy
Data Steward/Governance	Keep data legal and sane	Catalogs, PII policies, quality rules	Data dictionary, access policies, audits

Also sighted in the wild: BI Developer (focused on dashboards), Statistician (deep inference), Research Scientist (novel methods), Data QA (tests everything), and Subject-Matter Expert (the person who knows what “active user” actually means).

How the Workflow Actually Flows

Remember our Data Science lifecycle? We had questions, data, modeling, deployment, and feedback loops. Here’s who leads when:

Define the problem (PM + Analyst + SME)
- “Are we trying to reduce churn by 10% or summon the data demons?”
Acquire and validate data (Data Engineer + Steward)
- Source, permissions, PII handling, quality checks.
Model the business in the warehouse (Analytics Engineer)
- Transform raw logs into clean, testable tables.
Explore and hypothesize (Data Scientist + Analyst)
- EDA, feature ideas, back-of-the-envelope math.
Experiment and model (Data Scientist)
- Baselines, CV, metrics, error analysis.
Ship it (ML Engineer)
- Containerize, deploy, monitor. No, the notebook is not prod.
Observe and iterate (Everyone)
- Dashboards, alerts, post-mortems, v2 roadmap.

If it’s not monitored, it’s not deployed. It’s performance art.

A Day in the Life: Churn Prediction, But Make It Real

Imagine a subscription app is bleeding users. Crying. Screaming. CFO hyperventilating into a spreadsheet.

PM frames the bet: “If we can flag at-risk users 2 weeks early, we can save 8% of churn via targeted offers.” Success metric: retained users at 90 days.
Analyst defines current churn and segments. Finds retention cliff at day 14. Spicy.
Data Engineer ingests event logs from mobile/web, ensures user IDs aren’t a game of bingo.
Steward ensures we don’t email people who opted out. Because fines are spicy too.
Analytics Engineer builds a dbt model: one row per user per week with clean features (sessions, support tickets, plan type) and tests them.
Data Scientist prototypes a baseline (logistic regression), then adds features and compares to XGBoost. AUC improves from 0.68 to 0.79. Calibrates probabilities.
ML Engineer productionizes: feature store for live signals, model service endpoint, shadow deployment, latency <80ms.
Analyst + PM run an A/B test: targeted retention offer vs. control. Net lift = 6.7% with solid confidence.
Everyone monitors drift and ROI. Finance buys cookies.

Notice the baton passes. Notice how nobody tried to be all the roles at once. That’s the magic.

RACI: Who Does What (and Who Approves the Chaos)

A tiny, honest example for a model launch:

churn_model_v1:
  requirements_doc:
    responsible: [data_product_manager]
    accountable: [head_of_data]
    consulted: [analyst, sme]
    informed: [legal]
  feature_table:
    responsible: [analytics_engineer]
    accountable: [data_architect]
    consulted: [data_scientist]
    informed: [ml_engineer]
  training_pipeline:
    responsible: [data_scientist]
    accountable: [head_of_ml]
    consulted: [ml_engineer]
    informed: [analyst]
  deployment:
    responsible: [ml_engineer]
    accountable: [head_of_ml]
    consulted: [security, sre]
    informed: [pm, analyst]
  monitoring:
    responsible: [ml_engineer, analyst]
    accountable: [head_of_data]
    consulted: [data_scientist]
    informed: [business_ops]

Clarity prevents calendar crimes.

Anti-Patterns (a.k.a. How to Accidentally Sabotage Your Team)

The Lone Wolf Data Scientist: Brilliant model, nowhere to run it. Dies in a notebook.
ETL by Vibes: Unversioned SQL scattered across 14 dashboards. Metrics never match. Trust evaporates.
Dashboard Theater: Beautiful charts, zero decisions. KPI karaoke.
PM-less Chaos: Everyone sprinting, no North Star, budget becomes interpretive dance.
Governance Last: “We’ll fix PII later.” Later is subpoenas.

Collaboration Blueprints That Actually Work

Shared glossary + metrics layer: One definition of “active user,” not five.
Git all the things: dbt, notebooks (nbdev/Jupytext), ML code, infra as code. PR reviews = shared brain.
Contracts between layers: Schemas and SLAs for data products. Break it, you bought it.
Experiment registry: Track hypotheses, datasets, metrics, and results. Repeatability = credibility.
Observability: Data tests (dbt tests, Great Expectations), model monitoring (drift, latency, fairness), and incident playbooks.

What Tools Go Where? (Because Tool Sprawl Is Real)

Ingestion + Orchestration: Airflow, Dagster, Prefect
Storage + Compute: Snowflake/BigQuery/Redshift, Spark/Databricks
Transform + Metrics: dbt, semantic layers (LookML, MetricFlow)
Analysis + Viz: Jupyter, RStudio, Looker, Power BI, Mode
Modeling: scikit-learn, XGBoost, PyTorch, TensorFlow
MLOps: MLflow, Weights & Biases, Feature Stores, BentoML, SageMaker
Governance: Data catalogs (DataHub, Amundsen), access control, PII scanners

The rule of thumb: choose boring, reliable tools until your scale demands fancy.

Career Ladders and Crossovers (Yes, You Can Switch Lanes)

Analyst → Analytics Engineer: If you love SQL craftsmanship and reproducibility.
Analyst → Data Scientist: If you’re into inference, modeling, experiments.
Data Scientist → ML Engineer: If you enjoy shipping and infra.
Data Engineer → Architect: If big-picture design and cost/perf trade-offs thrill you.
Any → PM: If you’re allergic to ambiguity and love herding cats with Gantt charts.

Your superpower is not the tool; it’s the taste for trade-offs.

Quick Self-Check: Who Do You Call When…

The nightly pipeline failed and dashboards are blank? Data Engineer.
Two teams argue about “conversion rate” definitions? Analytics Engineer + Steward.
You need to choose between A/B test vs. quasi-experiment? Data Scientist + Analyst.
The model is great offline but flops in prod? ML Engineer (then Data Scientist).
The CEO wants a roadmap that saves money and makes money? Data PM + Architect.

Mini Script: From Question to Production

Business Question → Analyst + PM → clarified KPI and users
Raw Data → Data Engineer → ingested with quality checks
Clean Views → Analytics Engineer → modeled + tested tables
Model → Data Scientist → trained, validated, documented
Service → ML Engineer → deployed + monitored
Governance → Steward/Architect → compliant, cost-aware
Iteration → Everyone → lessons → v2 hypotheses

That linear list hides loops. In reality, it’s a spiral staircase: every turn gives you better views and a slight fear of heights.

Closing: The One-Slide Summary

Roles exist to reduce cognitive overload and increase reliability.
Great teams design handoffs, definitions, and monitoring before the first model.
Value = problem clarity × data quality × deployment discipline. Zero out any one, and the product is zero.

Put humans in the loop, put tests on the data, and put humility in the roadmap.

Next up, we’ll dive into the data lifecycle mechanics you can practice—so you’re not just admiring the team, you’re playing your position like an all-star.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics