jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Data Science : Begineer to Advance
Chapters

1Data Science Foundations and Workflow

What is Data ScienceRoles in a Data TeamData Science LifecycleCRISP DM and OSEMNProblem Framing and HypothesesData Types and FormatsStructured vs Unstructured DataReproducibility and Version Control BasicsNotebooks vs ScriptsEnvironments and Package ManagementData Ethics and Bias OverviewExperiment Tracking ConceptsDocumentation and Reporting BasicsProject Scoping and KPIsEssential Tools Overview

2Python Programming Essentials for Data Science

3Working with Data Sources and SQL

4Data Wrangling with NumPy and Pandas

5Data Cleaning and Preprocessing

6Exploratory Data Analysis and Visualization

7Probability and Statistics for Data Science

8Machine Learning Foundations

9Supervised Learning Algorithms

10Unsupervised Learning and Dimensionality Reduction

11Model Evaluation, Validation, and Tuning

12Feature Engineering and ML Pipelines

13Time Series Analysis and Forecasting

14Natural Language Processing

15Deep Learning, Deployment, and MLOps

Courses/Data Science : Begineer to Advance/Data Science Foundations and Workflow

Data Science Foundations and Workflow

98 views

Understand the data science landscape, roles, workflows, and tools. Learn problem framing, reproducibility, and ethical principles that guide successful projects from idea to impact.

Content

2 of 15

Roles in a Data Team

Assemble the Nerdvengers
9 views
beginner
humorous
science
education theory
gpt-5
9 views

Versions:

Assemble the Nerdvengers

Watch & Learn

AI-discovered learning video

YouTube

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Roles in a Data Team: The Heist Crew Edition

“Data science isn’t a solo sport. It’s a very nerdy Avengers film where Excel can kill you faster than Thanos.”

We already defined what data science is and why it matters (see: previous episode, where we convinced you that data is less like oil and more like sourdough starter). Today: who actually does the work? Because “data person” is not a single job—it's a squad.

Think of the data team as a restaurant. If you want a Michelin-star dish, you don’t ask the pastry chef to also fix the walk-in freezer and run marketing. Similarly, a clean, useful, ethical model requires many roles working together—each with a mission, toolkit, and vibe.


The Cast: Who’s in the Room (and Why)

Here’s the TL;DR roster. Bookmark this. Tattoo it. Or at least send it to your PM.

Role Core Quest Common Tools Primary Deliverables
Data Engineer Make data flow reliably SQL, Python/Scala, Spark, Airflow, Kafka, cloud storage Ingestion pipelines, data lakes/warehouses
Analytics Engineer Make data understandable dbt, SQL, versioned models, tests Cleaned tables, metrics layer, documentation
Data Analyst Turn questions into insights SQL, BI tools (Looker, Power BI), Python/R Dashboards, analyses, A/B readouts
Data Scientist Build and validate models Python/R, notebooks, scikit-learn, stats Experiments, models, insights, lift estimates
ML Engineer Ship and scale ML Python, Docker, CI/CD, feature stores, model serving APIs, deployment pipelines, monitoring
Data Product Manager Align work with business value Roadmaps, user research, OKRs Problem framing, prioritization, success metrics
Data Architect Design the big picture Cloud architecture, security, lineage Schemas, governance standards, cost strategy
Data Steward/Governance Keep data legal and sane Catalogs, PII policies, quality rules Data dictionary, access policies, audits

Also sighted in the wild: BI Developer (focused on dashboards), Statistician (deep inference), Research Scientist (novel methods), Data QA (tests everything), and Subject-Matter Expert (the person who knows what “active user” actually means).


How the Workflow Actually Flows

Remember our Data Science lifecycle? We had questions, data, modeling, deployment, and feedback loops. Here’s who leads when:

  1. Define the problem (PM + Analyst + SME)
    • “Are we trying to reduce churn by 10% or summon the data demons?”
  2. Acquire and validate data (Data Engineer + Steward)
    • Source, permissions, PII handling, quality checks.
  3. Model the business in the warehouse (Analytics Engineer)
    • Transform raw logs into clean, testable tables.
  4. Explore and hypothesize (Data Scientist + Analyst)
    • EDA, feature ideas, back-of-the-envelope math.
  5. Experiment and model (Data Scientist)
    • Baselines, CV, metrics, error analysis.
  6. Ship it (ML Engineer)
    • Containerize, deploy, monitor. No, the notebook is not prod.
  7. Observe and iterate (Everyone)
    • Dashboards, alerts, post-mortems, v2 roadmap.

If it’s not monitored, it’s not deployed. It’s performance art.


A Day in the Life: Churn Prediction, But Make It Real

Imagine a subscription app is bleeding users. Crying. Screaming. CFO hyperventilating into a spreadsheet.

  • PM frames the bet: “If we can flag at-risk users 2 weeks early, we can save 8% of churn via targeted offers.” Success metric: retained users at 90 days.
  • Analyst defines current churn and segments. Finds retention cliff at day 14. Spicy.
  • Data Engineer ingests event logs from mobile/web, ensures user IDs aren’t a game of bingo.
  • Steward ensures we don’t email people who opted out. Because fines are spicy too.
  • Analytics Engineer builds a dbt model: one row per user per week with clean features (sessions, support tickets, plan type) and tests them.
  • Data Scientist prototypes a baseline (logistic regression), then adds features and compares to XGBoost. AUC improves from 0.68 to 0.79. Calibrates probabilities.
  • ML Engineer productionizes: feature store for live signals, model service endpoint, shadow deployment, latency <80ms.
  • Analyst + PM run an A/B test: targeted retention offer vs. control. Net lift = 6.7% with solid confidence.
  • Everyone monitors drift and ROI. Finance buys cookies.

Notice the baton passes. Notice how nobody tried to be all the roles at once. That’s the magic.


RACI: Who Does What (and Who Approves the Chaos)

A tiny, honest example for a model launch:

churn_model_v1:
  requirements_doc:
    responsible: [data_product_manager]
    accountable: [head_of_data]
    consulted: [analyst, sme]
    informed: [legal]
  feature_table:
    responsible: [analytics_engineer]
    accountable: [data_architect]
    consulted: [data_scientist]
    informed: [ml_engineer]
  training_pipeline:
    responsible: [data_scientist]
    accountable: [head_of_ml]
    consulted: [ml_engineer]
    informed: [analyst]
  deployment:
    responsible: [ml_engineer]
    accountable: [head_of_ml]
    consulted: [security, sre]
    informed: [pm, analyst]
  monitoring:
    responsible: [ml_engineer, analyst]
    accountable: [head_of_data]
    consulted: [data_scientist]
    informed: [business_ops]

Clarity prevents calendar crimes.


Anti-Patterns (a.k.a. How to Accidentally Sabotage Your Team)

  • The Lone Wolf Data Scientist: Brilliant model, nowhere to run it. Dies in a notebook.
  • ETL by Vibes: Unversioned SQL scattered across 14 dashboards. Metrics never match. Trust evaporates.
  • Dashboard Theater: Beautiful charts, zero decisions. KPI karaoke.
  • PM-less Chaos: Everyone sprinting, no North Star, budget becomes interpretive dance.
  • Governance Last: “We’ll fix PII later.” Later is subpoenas.

Collaboration Blueprints That Actually Work

  • Shared glossary + metrics layer: One definition of “active user,” not five.
  • Git all the things: dbt, notebooks (nbdev/Jupytext), ML code, infra as code. PR reviews = shared brain.
  • Contracts between layers: Schemas and SLAs for data products. Break it, you bought it.
  • Experiment registry: Track hypotheses, datasets, metrics, and results. Repeatability = credibility.
  • Observability: Data tests (dbt tests, Great Expectations), model monitoring (drift, latency, fairness), and incident playbooks.

What Tools Go Where? (Because Tool Sprawl Is Real)

  • Ingestion + Orchestration: Airflow, Dagster, Prefect
  • Storage + Compute: Snowflake/BigQuery/Redshift, Spark/Databricks
  • Transform + Metrics: dbt, semantic layers (LookML, MetricFlow)
  • Analysis + Viz: Jupyter, RStudio, Looker, Power BI, Mode
  • Modeling: scikit-learn, XGBoost, PyTorch, TensorFlow
  • MLOps: MLflow, Weights & Biases, Feature Stores, BentoML, SageMaker
  • Governance: Data catalogs (DataHub, Amundsen), access control, PII scanners

The rule of thumb: choose boring, reliable tools until your scale demands fancy.


Career Ladders and Crossovers (Yes, You Can Switch Lanes)

  • Analyst → Analytics Engineer: If you love SQL craftsmanship and reproducibility.
  • Analyst → Data Scientist: If you’re into inference, modeling, experiments.
  • Data Scientist → ML Engineer: If you enjoy shipping and infra.
  • Data Engineer → Architect: If big-picture design and cost/perf trade-offs thrill you.
  • Any → PM: If you’re allergic to ambiguity and love herding cats with Gantt charts.

Your superpower is not the tool; it’s the taste for trade-offs.


Quick Self-Check: Who Do You Call When…

  • The nightly pipeline failed and dashboards are blank? Data Engineer.
  • Two teams argue about “conversion rate” definitions? Analytics Engineer + Steward.
  • You need to choose between A/B test vs. quasi-experiment? Data Scientist + Analyst.
  • The model is great offline but flops in prod? ML Engineer (then Data Scientist).
  • The CEO wants a roadmap that saves money and makes money? Data PM + Architect.

Mini Script: From Question to Production

Business Question → Analyst + PM → clarified KPI and users
Raw Data → Data Engineer → ingested with quality checks
Clean Views → Analytics Engineer → modeled + tested tables
Model → Data Scientist → trained, validated, documented
Service → ML Engineer → deployed + monitored
Governance → Steward/Architect → compliant, cost-aware
Iteration → Everyone → lessons → v2 hypotheses

That linear list hides loops. In reality, it’s a spiral staircase: every turn gives you better views and a slight fear of heights.


Closing: The One-Slide Summary

  • Roles exist to reduce cognitive overload and increase reliability.
  • Great teams design handoffs, definitions, and monitoring before the first model.
  • Value = problem clarity × data quality × deployment discipline. Zero out any one, and the product is zero.

Put humans in the loop, put tests on the data, and put humility in the roadmap.

Next up, we’ll dive into the data lifecycle mechanics you can practice—so you’re not just admiring the team, you’re playing your position like an all-star.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics