Foundations of AI and Data Science
Core concepts, roles, workflows, and ethics that frame end‑to‑end AI projects.
Content
Command line essentials
Versions:
Watch & Learn
AI-discovered learning video
Watch & Learn
AI-discovered learning video
Command Line Essentials: The Power Tools Your GUI Was Hiding From You
"The command line is like the gym for your brain — minimal decor, no distractions, wildly effective. Also a little scary until you learn where the weights go."
We just wrangled environments and dependencies, and had a civil-yet-spicy debate about notebooks vs scripts. Now it’s time to learn the thing that stitches those worlds together: the command line. The CLI is how you glue workflows, automate the boring parts, and yeet friction out of your data life. If you’ve ever thought, "There must be a faster way," the CLI politely says, "There is."
What Even Is a Shell (And Why Should AI People Care)?
- A shell is your text-based interface to the computer. Common ones:
- bash/zsh (macOS/Linux)
- PowerShell (Windows)
- You type commands; it does your bidding (usually). This is where you:
- Spin up/activate environments
- Run scripts and notebooks
- Inspect data files quickly
- Fetch datasets and wire up pipelines
Expert take: If your workflow can’t be expressed on the command line, it’ll be hard to automate, version, and scale. GUI clicks don’t commit to Git.
Navigating Like a Pro (aka: Stop Getting Lost)
You live in a filesystem. Know the neighborhood.
pwd— print working directory (where am I?)ls -lah— list files (show me everything, including hidden dotfiles)cd path/to/place— go somewherecd ..— go up one level;cd ~— go homemkdir -p data/raw— make directories, parents includedtouch notes.txt— create an empty filecp src.py backup/src.py— copy;mv a b— move/renamerm file;rm -r folder— remove (careful)
Paths & globs you will meet:
.= current dir,..= parent,~= home*.csvmatches all CSVs;data/{raw,processed}creates two dirs- Quote paths with spaces:
cd 'My Data'
Quick peek at files:
head -n 5 big.csv— first 5 linestail -n 5— last 5 lineswc -l big.csv— how many rowsdu -sh data/— folder size
Pipes, Redirection, and The Art of Doing 5 Things At Once
>redirect output to a file;>>append|pipe output of one command into the next
Examples you’ll use on day one:
# Count unique values in a column (CSV, comma-separated)
cut -d, -f3 data.csv | sort | uniq -c | sort -nr | head
# Save the first 1000 rows of a huge file
head -n 1000 big.csv > sample.csv
# Log output while still seeing it in the terminal
python train.py | tee logs/train.out
Working with compressed files:
zcat big.csv.gz | head
zgrep -i 'error' logs.gz
Your pipeline is a conveyor belt. Each command adds a transformation. Lego, but for text.
Find Stuff Fast: grep, find, jq (Your New Besties)
grep -R 'pattern' .— search recursively for text in filesgrep -R --line-number --ignore-case 'todo' src/find . -name '*.ipynb' -maxdepth 2— find notebooks nearby
For JSON (APIs, logs), meet jq:
# Pretty-print JSON
echo '{"acc":0.91,"loss":0.23}' | jq
# Extract a field from a JSONL dataset
jq -r '.label' data.jsonl | sort | uniq -c
Lightweight text surgery:
# Replace tabs with commas in a TSV
sed 's/\t/,/g' data.tsv > data.csv
# Sum the 2nd column (numbers only)
awk -F, '{sum += $2} END {print sum}' data.csv
Why do people keep misunderstanding this? Because grep/awk/sed look like line noise. But they’re fast, composable, and perfect for quick checks without spinning up Python.
Environments & Dependencies — But Make It CLI
Remember our environment saga? Here’s the command-line muscle behind it.
Conda:
conda create -n ds-env python=3.11
conda activate ds-env
conda install numpy pandas scikit-learn
conda env export > environment.yml
venv + pip:
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip freeze > requirements.txt
Path sanity checks:
which python # macOS/Linux
where python # Windows
python -c 'import sys; print(sys.executable)'
Environment variables (for API keys, secrets):
export OPENAI_API_KEY=sk-...
export WANDB_PROJECT=my-experiment
python train.py
Pro-tip: use a .env file with a loader (e.g., python-dotenv) or direnv so you don’t accidentally leak secrets in bash history.
Notebooks vs Scripts: Command-Line Edition
- Start a notebook server:
jupyter lab # or: jupyter notebook
- Run a notebook headlessly (great for CI):
jupyter nbconvert --to notebook --execute notebook.ipynb --output executed.ipynb
- Run a script with arguments:
python train.py --epochs 10 --lr 3e-4 --data data/processed
- Make a script directly executable:
# In train.py, add the first line:
# !/usr/bin/env python
chmod +x train.py
./train.py --help
Notebooks are for exploration; scripts are for repeatability. The CLI is how you move from vibes to verified.
Git, Quickly (Because Future You Deserves Nice Things)
git init
git status
git add src/ notebook.ipynb requirements.txt
git commit -m 'Add baseline model'
- Use
.gitignoreto avoid committing gigantic datasets and environment folders:
# .gitignore
*.pyc
.venv/
__pycache__/
.env
/data/
Bonus: git lfs for large artifacts, or use dataset registries and keep repos lean.
Fetch Data Like a Hacker (Legally)
curl -L -o data/raw/housing.csv https://example.com/housing.csv
wget -P data/raw https://example.com/housing.csv
# Test an API and parse JSON
curl -s 'https://api.example.com/items?limit=5' | jq '.items[] | {id, name}'
Remote machines:
ssh user@server
scp model.pkl user@server:/home/user/models/
Permissions, Sudo, and Other Spicy Buttons
- Who am I?
whoami - What’s executable?
ls -l - Make it executable:
chmod u+x script.sh - Ownership:
chown user:group file
Use sudo sparingly. If you need it to install Python packages, consider fixing your environment instead.
A good rule: if a command makes you sweat, try a dry-run or read the
--helpfirst.
Customize Your Shell (Treat Yo’Self)
- Add aliases and functions in
~/.bashrcor~/.zshrc:
alias gs='git status'
alias ll='ls -lah'
function mkcd() { mkdir -p "$1" && cd "$1"; }
- Persistent environment setup:
export PYTHONBREAKPOINT=ipdb.set_trace
export PIP_INDEX_URL=https://pypi.org/simple
Reload with source ~/.zshrc (or open a new terminal).
Cross-Platform Notes (So You Don’t Cry Later)
- Windows: PowerShell is not bash. Install WSL for a Linux-like environment.
- Paths: Windows uses backslashes; bash uses slashes. Many tools expect
/. - Quoting rules differ; when scripts must run everywhere, prefer Python entrypoints.
Cheat Sheet: Commands You’ll Actually Use
| Command | What it does | Why a data person cares |
|---|---|---|
ls -lah |
List files with sizes | Spot giant CSVs before RAM screams |
head/tail |
Peek at files | Sanity-check data quickly |
wc -l |
Count lines | Instant row count |
cut/sort/uniq |
Column ops + dedupe | Explore categories and frequency |
grep -R |
Search text recursively | Find code, configs, log patterns |
find |
Locate files by name/type | Hunt notebooks or models |
jq |
JSON query | APIs, logs, configs at speed |
conda/venv |
Manage environments | Reproducible science |
python script.py |
Run scripts | Batch jobs, automation |
jupyter nbconvert |
Execute notebooks | CI and reproducibility |
curl/wget |
Download data | Pipeline inputs |
git |
Version control | Collaborate without chaos |
Small Frictions That Cause Big Headaches (and Fixes)
- Spaces in filenames? Use quotes:
cd 'My Data' - Accidentally nuked a folder with
rm -r? Considertrash-clito send to system trash. - Mysterious 'command not found'? Check
echo $PATH. If a tool isn’t on PATH, either reinstall or export its path. - Python mismatch?
which python, thenpython -V. Activate the right environment. - Slow notebook? Check running processes:
toporhtop(install), and watch that memory.
Try This Mini-Workflow
# 1) Create project skeleton
mkdir -p ds-project/{data/raw,data/processed,src,notebooks}
cd ds-project
# 2) Environment
python -m venv .venv && source .venv/bin/activate
pip install pandas scikit-learn jupyter
# 3) Get data
curl -L -o data/raw/titanic.csv https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
# 4) Quick checks
wc -l data/raw/titanic.csv
head data/raw/titanic.csv | cut -d, -f3 | sort | uniq -c
# 5) Start notebook for exploration
jupyter lab
If it feels smooth, you’ve tasted CLI power. If it feels chaotic, that’s normal — you just leveled up from tourist to apprentice.
Wrap-Up: The CLI Is Your Exoskeleton
- The command line gives you speed, automation, and reproducibility.
- Environments, notebooks, and scripts all become more useful when you can glue them with pipes, redirection, and a few trusty utilities.
- Your future self (and your teammates) will thank you for commands that can be documented, versioned, and rerun.
Final insight: Tools change; text interfaces endure. Learn the CLI once, and every new stack bows a little faster.
Now go open a terminal and make your computer do tricks.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!