jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Full Stack AI and Data Science Professional
Chapters

1Foundations of AI and Data Science

AI vs Data Science landscapeRoles and workflowsProject lifecycle CRISP-DMProblem framingData types and formatsMetrics and evaluation basicsReproducibility and versioningNotebooks vs scriptsEnvironments and dependenciesCommand line essentialsGit and branchingData ethics and bias overviewPrivacy and governance basicsExperiment tracking overviewReading research papers

2Python for Data and AI

3Math for Machine Learning

4Data Acquisition and Wrangling

5SQL and Data Warehousing

6Exploratory Data Analysis and Visualization

7Supervised Learning

8Unsupervised Learning and Recommendation

9Deep Learning and Neural Networks

10NLP and Large Language Models

11MLOps and Model Deployment

12Data Engineering and Cloud Pipelines

Courses/Full Stack AI and Data Science Professional/Foundations of AI and Data Science

Foundations of AI and Data Science

47 views

Core concepts, roles, workflows, and ethics that frame end‑to‑end AI projects.

Content

2 of 15

Roles and workflows

Ocean's Eleven, But For Data: The No-Chill Roles & Workflows Guide
3 views
beginner
humorous
computer science
data science
gpt-5
3 views

Versions:

Ocean's Eleven, But For Data: The No-Chill Roles & Workflows Guide

Watch & Learn

AI-discovered learning video

YouTube

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Foundations of AI & Data Science — Roles and Workflows

Previously on "AI vs Data Science Landscape": we learned who brings the vibes (AI), who brings the receipts (Data Science), and why they’re not the same thing but do text each other at 2 AM. Today: who does what, when, and how do we not all step on the same Jupyter notebook?


Opening: The Sprint Planning Where Everyone Says "It Depends"

Imagine the meeting: product wants a "smart feature," legal wants "not a lawsuit," engineering wants "reproducibility," and your data is in five warehouses and two Google Sheets. Who moves first? Who owns what? Who gets yelled at at 4:59 PM on Friday?

This is the map. We’re going to:

  • Identify the roles in full-stack AI/Data projects.
  • Outline the workflows they use (analytics, predictive ML, and GenAI/RAG).
  • Show the handoffs and artifacts that keep the chaos civilized.
  • Give you a real-world walkthrough so your brain doesn’t stage a walkout.

TL;DR: Roles are the cast; workflows are the script; artifacts are the receipts.


The Cast List: Who Does What (and Why Your Model Keeps Crying)

Role Core Mission Key Deliverables Likely Tools
Data Engineer Make data exist, fast, and not on someone’s laptop ETL/ELT pipelines, data models, SLAs SQL, Spark, Airflow, dbt, Kafka, Snowflake/BigQuery
Analytics Engineer Turn raw data into analytics-ready joy Semantic layers, curated tables, tests dbt, SQL, metrics layers, Great Expectations
Data Analyst / BI Answer questions, build narratives Dashboards, ad-hoc analyses, KPI definitions SQL, Tableau/Looker/Power BI, Python/R
Data Scientist Frame questions, model uncertainty Experiments, models, insights, A/B designs Python/R, scikit-learn, stats, notebooks
ML Engineer Productionize models without breaking prod Training pipelines, serving APIs, feature stores Python, TensorFlow/PyTorch, Kubeflow, Feast
AI Engineer (GenAI) Build with LLMs and retrieval like it’s LEGO RAG pipelines, prompts, eval harness LangChain/LlamaIndex, vector DBs, OpenAI/Anthropic, eval suites
MLOps/Platform Make the whole circus repeatable CI/CD/CT, model registry, monitoring Kubernetes, MLflow, Seldon/Bento, Grafana/Prometheus
Research Scientist Invent the new math/methods Papers, prototypes, beakers full of loss curves PyTorch/JAX, custom training loops
Product Manager (Data/AI) Define value and success Problem framing, roadmap, success metrics Docs, PRDs, OKRs, stakeholder wrangling
Domain Expert Reality check Labels, constraints, business logic Their brain + annotation tools
Responsible AI / Risk Keep you out of ethical/fraud land Impact assessments, bias reports, guardrails Fairlearn, audit frameworks
Labeling Ops / QA Ground truth farmers Labeled datasets, QA audits Label Studio, Mechanical Turk

Quote for your wall: "If everyone owns it, no one monitors it." — Every Postmortem Ever


Workflows: Three Archetypes You’ll Actually Use

We’re not repeating the whole "AI vs DS" thing; you remember: AI builds systems that act; DS builds understanding that informs action. Now, let’s see the actual plays.

1) Analytics/Insights (CRISP-DM / OSEMN)

Think: revenue analysis, funnel drop-offs, churn drivers.

  • Business understanding → Data understanding → Data prep → Modeling/Analysis → Evaluation → Deployment (dashboard/report)

ASCII Vibes:

Question → Data → Clean → Explore → Model/Test → Interpret → Dashboard

Roles in motion:

  • PM + Analyst define KPIs and questions.
  • Data/Analytics Engineers create reliable tables/metrics.
  • Data Scientist runs exploratory analysis, causal inference, or hypothesis tests.
  • Analyst/PM deploy dashboard and run A/Bs.

Pitfall: Shipping a dashboard without a definition of the metric. Congrats, you built a very pretty argument.


2) Predictive ML Lifecycle (Supervised/Time Series)

Think: fraud detection, demand forecasting, recommendations.

Frame → Data & Labels → Features → Train → Evaluate → Ship → Monitor → Retrain
  • Frame: Who benefits? What decision changes? What is the offline metric vs. online metric?
  • Data/Labels: Source events, create label definitions, handle leakage.
  • Features: Batch and online features, feature store registration, transformations.
  • Train: Experiments tracked (MLflow/W&B). Reproducible configs.
  • Evaluate: Offline metrics + fairness + robustness + cost curves.
  • Ship: Containerized model; canary or shadow deployments.
  • Monitor: Data drift, concept drift, latency, cost, model performance.
  • Retrain: Triggered by schedule, drift, or performance degradation.

Hot take: "Accuracy" is a vibes-based metric if your base rate is 0.5%.


3) GenAI / RAG Lifecycle (LLMs in the Wild)

Think: support chatbot, document Q&A, code assistants.

Use-case → Data ingestion → Chunk & embed → Index → Prompting → Guardrails → Eval → Deploy → Observe → Iterate
  • Use-case: Who asks what? What sources are authoritative? Latency/cost targets?
  • Ingest: PDFs, HTML, APIs → clean, dedupe, version.
  • Chunk & Embed: Chunk strategies matter; storage in vector DBs.
  • Index: Hybrid search (BM25 + embeddings), metadata filters.
  • Prompting: System prompts, few-shot, tools/functions.
  • Guardrails: PII redaction, ground-truth citation, rate limits, safety filters.
  • Eval: Groundedness, hallucination rate, task success, human-in-the-loop (HITL).
  • Deploy: API, chat UI, caching, observability.
  • Iterate: Prompt tweaks, reranking, retrieval improvements, finetuning if needed.

Meme line: "We’ll fix it in prompt" is the GenAI version of "works on my machine."


Handoffs and Artifacts: The Social Contract of Data Work

  • Data contracts: schemas, freshness SLAs, versioning.
  • Feature definitions: transformation code + documentation + owners.
  • Experiment records: config, seed, data version, metrics, charts.
  • Model registry: versioned models with stage (staging/production/archived).
  • Evaluation reports: offline metrics, fairness analysis, ablation studies.
  • Serving contracts: API spec, latency/SLOs, fallback behavior.
  • Monitoring dashboards: prediction distributions, drift, cost per 1k calls.
  • Risk docs: DPIA, model cards, safety tests.
  • Prompt libraries: versioned prompts, test cases, guardrail rules.

If it’s not versioned, it’s a ghost. If it’s not monitored, it’s a liability.


Real-World Walkthrough: Building a Support Chatbot with RAG

Scenario: Your SaaS company wants a chatbot that answers customer questions from docs and tickets with citations.

  1. Framing (PM + Domain Expert)

    • Goal: Deflect 30% of tier-1 tickets while keeping CSAT ≥ 4.5/5.
    • Constraints: No PII leaks, latency < 2s, cost < $0.01/request.
  2. Data Ingestion (Data/Analytics Engineer)

    • Pull docs from Confluence/GitHub, tickets from Zendesk.
    • Clean HTML, strip navigation, dedupe near-identical pages.
    • Artifact: Cleaned corpus v1.2, change log.
  3. Chunk & Embed (AI Engineer)

    • Chunk 400–800 tokens with semantic overlap.
    • Embed with model X; store in vector DB with metadata: product, version, date.
    • Artifact: Index v1.2.3, embedding config.
  4. Retrieval & Reranking (AI Engineer)

    • Hybrid retrieval (BM25 + embeddings), rerank top-100 → top-5.
    • Add citation extraction and snippet highlighting.
  5. Prompting & Tools (AI Engineer)

    • System prompt: "Answer strictly from sources; cite them; if unsure, ask for clarification."
    • Tools: “search_docs”, “fetch_article”, “get_account_status” (with RBAC).
  6. Guardrails (Responsible AI + Security)

    • Block PII exposure, profanity filters, jailbreak tests.
    • Red-team prompts; add refusal patterns.
  7. Evaluation (Data Scientist + AI Engineer)

    • Build eval set of 300 real questions with human-labeled answers and citations.
    • Metrics: groundedness (factuality), citation accuracy, answer correctness, latency, cost.
    • HITL loop: weekly sampling + rubric.
  8. Deploy (ML Engineer + MLOps)

    • Containerize service, enable caching, request tracing.
    • Canary rollout to 10% traffic; monitor CSAT and containment rate.
  9. Monitor & Iterate (Everyone)

    • Drift: doc updates → re-embed changed chunks nightly.
    • Observability: top failure modes → add few-shot exemplars or retrieval tweaks.

Result: Not magic, just many boring, correct steps in a spicy trench coat.


Misunderstandings To Retire

  • "Data scientists should own production." They should own science quality; ML/AI engineers own runtime machinery.
  • "We don’t need data engineering with LLMs." You need it more. Garbage in → poetic garbage out.
  • "If offline AUC is high, ship it." Ship a rollout plan and monitoring with it.
  • "Prompting replaces evaluation." If you can’t measure it, you can’t fix it (or defend it to Legal).

Mini Blueprint: End-to-End ML Project Skeleton

project:
  name: demand_forecast
  stages:
    - frame: {owner: pm, success_metric: MAPE<12%, horizon: 4w}
    - data: {owner: data_engineer, sources: [orders, inventory], version: v0.3}
    - features: {owner: ml_engineer, store: feast, tests: great_expectations}
    - train: {owner: ds, tracker: mlflow, seed: 42, model: xgboost}
    - eval: {owner: ds, metrics: [MAPE, WAPE], bias_check: true}
    - deploy: {owner: ml_eng, strategy: canary, SLO: p95<120ms}
    - monitor: {owner: mlops, drift: kolmogorov, alerts: pagerduty}
    - retrain: {trigger: weekly|drift, approval: pm+ds}
artifacts:
  registry: mlflow://models/demand_forecast
  dashboards: grafana://dashboards/forecast-health

Choose-Your-Own-Adventure: Which Hat Fits?

  • Love SQL and building reliable tables? You’re an Analytics Engineer.
  • Obsessed with experiments and causality? Data Scientist energy.
  • Want models to survive production? ML Engineer/MLOps.
  • Love playing with LLMs, prompts, and retrieval tricks? AI Engineer.
  • Can you explain trade-offs to execs and engineers? Data/AI PM.

Career cheat code: pick a lane, then learn the two roles to your left and right.


Closing: The Orchestra, Not the Solo

  • Roles exist so focus can exist. Respect the handoffs.
  • Workflows are guardrails. Pick the right archetype (analytics, predictive ML, GenAI/RAG) and stick to its receipts.
  • Artifacts are the audit trail. Version everything. Monitor everything.

Big idea to tattoo on your project wiki:

AI and Data Science work when insight, engineering, and responsibility move in lockstep — not in heroics.

Now go make fewer meetings chaotic and more models useful. Bonus points if nothing is secretly running off your intern’s laptop.

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics