Foundations of AI and Data Science
Core concepts, roles, workflows, and ethics that frame end‑to‑end AI projects.
Content
Git and branching
Versions:
Watch & Learn
AI-discovered learning video
Git and Branching: The Multiverse of Your Codebase
You conquered the command line and made virtual environments behave. Now it’s time to stop naming files final_final_really_final.ipynb and let Git keep your receipts.
Git isn’t just a tool; it’s a time machine with parallel universes. In AI and data science, where experiments multiply like gremlins after midnight, Git and branching keep your work reproducible, collaborative, and less “oops I overwrote the only model that worked.”
Quick Mental Model (a.k.a. Git in Three Rooms)
- Working directory: Where your files live now. Messy. Loud. Full of ideas.
- Staging area (index): The waiting room. You curate what changes you want to save.
- Repository history: The museum. Every commit is a framed snapshot of your project.
Commit early, commit often. Small commits are like snack-sized history: easy to digest and less likely to choke future-you.
The Essentials You Actually Use
Remember your command-line superpowers? You’ll use them here. Here’s the Git core, minus the guilt.
| Command | What it does | When to use |
|---|---|---|
git init |
Starts a repo | New project folder you want versioned |
git clone <url> |
Copies a remote repo locally | Joining an existing project |
git status |
Shows what changed | Constantly; it’s your dashboard |
git add <files> |
Stage changes | Prepare exactly what to commit |
git commit -m "message" |
Save a snapshot | After a logical unit of work |
git log --oneline --graph --all |
Visualize history | When you’re lost in the sauce |
git branch / git switch -c feature |
Create and list branches | Start isolated work |
git merge |
Combine histories | Bring feature back to main |
git rebase |
Replay commits elsewhere | Clean up your branch (carefully) |
git remote -v |
Shows remotes | Where you push/pull |
git fetch / git pull |
Get remote changes | Sync regularly |
git push |
Send your work to remote | Share with the team |
Pro tip from environments land: pair commits with environment changes. If you add a new library, update requirements.txt or environment.yml in the same commit. Keep changes atomic.
Branching 101: How to Live Many Lives (Safely)
Think of branches as parallel universes for your code. The default branch is often main. You spin off feature branches to experiment.
Create and switch
# Create and switch to a new branch for data cleaning
git switch -c feature/data-cleaning
# Do your thing (edit scripts, notebooks)
# Stage and commit granularly
git add scripts/clean.py
git commit -m "feat(clean): add basic outlier removal"
Merge back when ready
# Switch to main and bring in your feature
git switch main
git pull # ensure you’re up to date
# Merge the feature branch
git merge feature/data-cleaning
# If no conflicts, great! Then push.
git push
Naming branches? Be descriptive:
feature/eda,bugfix/na-handling,experiments/xgboost-params. Your future self will write you a thank-you note.
Merge vs Rebase: History, but Make It Drama
- Merge: Creates a merge commit that ties two histories together. Keeps the truth of what happened.
- Rebase: Replays your commits on top of a target branch, making it look like you developed straight from there. Clean history, slightly spicy.
# Rebase your feature branch on top of updated main
git switch feature/data-cleaning
git fetch
git rebase origin/main
# Resolve conflicts if any, then continue
git rebase --continue
Rule of thumb: rebase your own feature branches before merge; do not rebase public/shared branches others might have pulled. Think: tidy your room, don’t bulldoze the neighborhood.
Merge Conflicts: The Inevitable, Not the End
Conflicts happen when two branches edited the same lines. Git marks the file with conflict markers:
<<<<<<< HEAD
scaled_df = scaler.fit_transform(df)
=======
scaled_df = scaler.transform(df)
>>>>>>> feature/predict-only
How to fix:
- Open the file, choose or combine the correct lines.
- Test your code (run unit tests or a quick notebook cell).
- Mark as resolved and continue:
git add path/to/file.py git commit # for merges # or during rebase git rebase --continue
Breathe. Conflicts mean you and a teammate both cared enough to improve the same logic. That’s collaboration, baby.
Keep the Junk Out: .gitignore for Data Folks
Your environments lesson said: isolate dependencies. Now: do not version every artifact your pipeline sneezes out.
# Python cruft
__pycache__/
*.pyc
.venv/
venv/
# Jupyter detritus
.ipynb_checkpoints/
# OS fluff
.DS_Store
# Secrets and configs
.env
*.secret
# Data and models (use LFS/DVC instead)
/data/
/models/
/artifacts/
Big files and datasets
- Use Git LFS for large, immutable binaries like pretrained weights:
git lfs install git lfs track "models/*.bin" git add .gitattributes git commit -m "chore: track model binaries with LFS" - For evolving datasets and pipelines, consider DVC (Data Version Control). It stores metadata in Git and your blobs in remote storage. Chef’s kiss for reproducibility.
Never commit API keys. If you do, rotate the key, then purge with
git filter-repoor the GitHub UI tools. Secrets are not collectibles.
Remotes, Forks, and PRs (a.k.a. Social Git)
- origin: Your primary remote. Usually on GitHub/GitLab.
- Upstream: The source repo you forked from (if you’re contributing to an open source project).
# Add an upstream remote for syncing with the source repo
git remote add upstream https://github.com/org/project.git
# Keep your main fresh
git fetch upstream
git switch main
git merge upstream/main
Pull Requests (PRs)
- Branch off
main. - Commit small, with clear messages (try Conventional Commits):
feat: add stratified split helperfix: handle NaNs in normalizationchore: update requirements
- Push and open a PR with a checklist: tests pass, docs updated, no giant data files, environment files in sync.
Tip: Pair each PR with a short “how to test” note. Reviewers love not guessing.
Branching Strategies Without Tears
- Trunk-based (recommended for DS/ML): Small, short-lived branches. Merge to
maindaily with CI running tests and linting. Fast feedback. - Git Flow (heavier):
develop,release,hotfixbranches. Better for big products, overkill for most notebooky work. - Experiment branches: Prefix with
exp/and set expectations: may never merge; used for exploring ideas.
Default to trunk-based. Your experiments are many; your patience is finite.
A Day in the Life: Git x Data Science Workflow
# 0) Sync and branch
git switch main && git pull
git switch -c feature/eda-correlation
# 1) Work in small steps; track deps
python -m venv .venv && source .venv/bin/activate
pip install pandas seaborn
pip freeze > requirements.txt
git add notebooks/eda.ipynb requirements.txt
git commit -m "feat(eda): add correlation heatmap; record deps"
# 2) Push and share
git push -u origin feature/eda-correlation
# 3) Open PR, get review, address comments
# (make more commits on the same branch)
git switch main && git pull
# 4) Merge via PR (squash if you want tidy history) and delete branch
Bonus visual when you’re lost:
git log --oneline --graph --decorate --all
Notebooks and Diffing Like a Pro
- Use
jupyter nbconvert --ClearOutputPreprocessor.enabled=Trueor an extension to clear outputs before committing. - Consider
nbdimefor nicer notebook diffs. - Or keep notebooks as artifacts and push the logic into
.pymodules; notebooks call those modules. Cleaner diffs, happier reviews.
Common Pitfalls (And How Not to Cry)
- Committed secrets? Rotate keys, then purge history with
git filter-repo. - Huge data clogging repo? Move to LFS/DVC; add directories to
.gitignore. - “Detached HEAD” panic? Just create a branch where you are:
git switch -c rescue/my-work. - Rebase gone sideways?
git rebase --abortor use the reflog:git reflogthengit reset --hard <good_commit>.
Reflog is the black box recorder of your repository. When you think all is lost, it whispers, “Try again.”
Wrap-Up: Your Key Takeaways
- Branches let you experiment without wrecking
main. Live your best multiverse life. - Commit small, message clearly, and tie environment updates to code changes.
- Merge for honesty; rebase for tidiness. Don’t rewrite shared history.
- Ignore junk; version code and metadata, not gigabytes of raw data.
- Use PRs for feedback, visibility, and fewer “it works on my machine” tragedies.
Final thought: Git doesn’t make you perfect; it makes you recoverable. In data science, that’s the difference between a one-off miracle and a reproducible result.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!