Data Sources, Engineering, and Deployment
Acquire data from files, web, and databases; then test, package, version, and deploy reliable services.
Content
Data Versioning with DVC
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Data Versioning with DVC — Make Your Data and Models Reproducible (Without a Meltdown)
"You already know how to version code with Git and scale data with Spark — now let’s stop pretending datasets and model weights are just ‘files’ and start treating them like first-class citizens."
Why DVC matters (building on Git & Spark)
You learned Git & GitHub workflows for code and Spark for huge datasets. Great. But Git does not like giant CSVs, binary model weights, or multi-GB intermediate artifacts. DVC (Data Version Control) is the tool that fills that gap: it versions data, models, and pipelines while letting Git keep the lightweight metadata. Think: Git for code + luggage service for heavy files.
Where you'll use it in real life:
- Tracking raw datasets, preprocessing outputs, and trained PyTorch model weights across experiments.
- Reproducing training runs (so deployment doesn’t become a mysterious ritual).
- Collaborating: team members can checkout a commit and dvc pull to get the exact data and models used.
Core concepts — short and punchy
- Metadata-only in Git: DVC stores tiny pointer files (.dvc and dvc.yaml) in Git, not the raw GBs.
- Remote storage: large files live in S3/GCS/SSH or other remotes; dvc push/pull syncs them.
- Cache: DVC uses a local cache to avoid re-downloading unchanged files.
- Pipelines: dvc.yaml describes stages (deps → cmd → outs), so runs are reproducible.
- Experiments: dvc exp helps run and compare hyperparameter variations without polluting Git history.
Quick example workflow (the commands you'll actually run)
- Initialize:
git init
dvc init
- Add a large dataset or model:
dvc add data/raw/images
git add data/.gitignore data/raw/images.dvc
git commit -m "Add raw images metadata"
- Configure remote storage and push:
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push # uploads cached files to the S3 remote
- Reproduce pipeline locally (or in CI):
git checkout <commit>
dvc pull # download the dataset and models for this commit
dvc repro # run pipeline stages to reproduce outputs
DVC pipelines — connect your PyTorch training and Spark steps
Create a dvc.yaml that chains data prep (maybe a Spark job), training (PyTorch), and evaluation:
stages:
prep:
cmd: spark-submit --class prep job.py data/raw data/prepared
deps:
- scripts/prep.py
- data/raw
outs:
- data/prepared
train:
cmd: python train.py --config params.yaml
deps:
- src/train.py
- data/prepared
outs:
- models/model.pt
params:
- training.epochs
eval:
cmd: python eval.py models/model.pt data/val metrics.json
deps:
- src/eval.py
- models/model.pt
- data/val
outs:
- metrics.json
Why this is 🔑:
- DVC records all dependencies and outputs, so
dvc reproonly reruns necessary stages. - You can have a Spark stage producing cleaned parquet files, then a PyTorch stage that consumes them.
Experiments and hyperparameters (PyTorch lovers, listen up)
Use params.yaml to keep hyperparameters tracked and versionable:
training:
epochs: 10
lr: 0.001
Run an experiment: change params, run an experiment, and compare metrics without committing to Git history:
dvc exp run # runs the pipeline with current params
dvc exp show # tabular view of experiments
dvc metrics diff HEAD # compare metrics against the main branch
When you're ready to keep an experiment permanently: dvc exp apply then commit (and optionally push DVC-tracked outputs).
Collaboration & CI: how DVC interacts with Git & GitHub Workflows
You already use GitHub Actions for tests — add a couple of DVC steps so CI can reproduce and validate models before deployment.
Minimal GitHub Actions snippet:
- uses: actions/checkout@v3
- name: Setup DVC
uses: iterative/setup-dvc@v2
- run: dvc pull --remote myremote
- run: dvc repro
- run: dvc metrics show -j
Tips:
- Keep only .dvc files and dvc.yaml in Git.
- Store real data in a secure remote (S3 with proper IAM or a private GCS bucket).
- Use Git tags/releases to mark model-ready commits and push DVC outputs with
dvc push.
Best practices (so your team doesn’t suffer)
- Small metadata in Git, large files in remotes. Never commit raw dataset binaries to Git.
- Track parameters and metrics. Put hyperparams in params.yaml and metrics in JSON (DVC reads metrics automatically).
- Use branches or dvc experiments for exploratory work. Merge only the successful experiments.
- Be disciplined with remotes. Configure a default remote and backup policy; treat storage costs seriously.
- Document data provenance. Use
dvc import-urlfor external datasets so provenance is explicit.
Caveats & real-world considerations
- DVC is not a data catalog or monitoring system — pair it with tools like Great Expectations or a metadata catalog if you need schema checks and lineage dashboards.
- Remote bandwidth and storage costs can add up. Use lifecycle rules on S3 and sensible retention.
- DVC cache can grow — use
dvc gcto clean unused cache when safe.
Quick comparison table
| Thing | Git | DVC |
|---|---|---|
| Large files | No (too big) | Yes (remote + cache) |
| Lightweight metadata | Yes | Yes (.dvc, dvc.yaml) |
| Pipelines | Basic (hooks) | Full dependency graph (dvc repro) |
| Experiments | Ad-hoc branches | Built-in (dvc exp) |
Final takeaways — for the caffeinated student
- DVC is the bridge between your code (Git) and heavy data/model artifacts (S3/etc). Use it to make ML reproducible.
- Integrates naturally with Spark preprocessing and PyTorch model training — declare stages and let dvc handle the wiring.
- Use experiments to try many hyperparameters without polluting Git; when one wins, apply and commit.
"Treat your data like your code’s trusted coauthor: versioned, referenced, and never left behind in someone’s Downloads folder."
If you want, I can:
- Generate a ready-to-use dvc.yaml + params.yaml + GitHub Action for a small PyTorch project, or
- Show how to import a public dataset, run a Spark preprocessing stage, and train a model reproducibly with DVC.
Pick one and I’ll spin it up with annotated files and commands.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!