Bonus Labs: Hands-on with Hugging Face PEFT and QLoRA on Llama/Mistral
Hands-on, lab-focused learning with real models to solidify PEFT workflows, QLoRA experimentation, and end-to-end fine-tuning that mirrors production setups.
Content
12.1 Lab Setup: Environment and Reproducibility
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Bonus Labs: Hands-on with Hugging Face PEFT and QLoRA on Llama/Mistral
12.1 Lab Setup: Environment and Reproducibility
"If you mess up the environment, your model will be right 100% of the time — wrong in a reproducible way."
You already learned to budget for bug bashes, negotiate vendor credits, and measure capex vs opex in the Cost Modeling arc. Good. Now we convert that fiscal wisdom into something practical: an environment that is stable, repeatable, and cheap enough to not bankrupt your team's snack fund. This lab gets you from chaos to deterministic(ish) training runs for PEFT and QLoRA on Llama or Mistral-style weights.
Why this matters (quick recap to connect to previous units)
- Cost modeling taught you which instance types and preemption strategies save money. But if your environment is flaky, those savings are eaten by wasted runs and mysterious performance regressions.
- Vendor negotiation might have secured model licenses or infra credits. Use reproducibility to actually spend those credits wisely — not on reruns chasing nondeterministic failures.
In short: tight environments = predictable spend + predictable outcomes. Now let us provision that predictability.
1) Core principles for reproducible PEFT/QLoRA labs
- Pin everything: Python, CUDA, torch, bitsandbytes, transformers, peft, accelerate. Versions matter.
- Containerize: Docker keeps your local machine from being the wild card. Use the same container on dev and CI.
- Seed everything: Python random, NumPy, Torch, dataloader workers, and any library RNGs.
- Document and log: commit a requirements file, accelerate config, and a small README. Use W&B or MLflow for run metadata and config.
2) Example environment artifacts
Below are practical snippets to drop into your repo. Treat them as templates — not holy scripture.
Minimal requirements.txt (pin versions)
torch==2.2.0
transformers==4.33.2
accelerate==0.20.3
peft==0.4.0
bitsandbytes==0.41.0
safetensors==0.4.2
datasets==2.13.0
tokenizers==0.15.2
wandb==0.15.2
numpy==1.26.0
Tip: pip freeze > requirements.txt after setting up a golden env, then use pip install -r requirements.txt for reproducibility.
Dockerfile skeleton
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y git python3-pip
COPY requirements.txt /tmp/
RUN python3 -m pip install --upgrade pip
RUN pip install -r /tmp/requirements.txt
WORKDIR /workspace
accelerate config (accelerate config defaulted via CLI is fine too)
compute_environment: LOCAL_MACHINE
distributed_type: NO
mixed_precision: bf16
num_processes: 1
3) Seed and deterministic settings (Python snippet)
import os
import random
import numpy as np
import torch
SEED = 42
os.environ['PYTHONHASHSEED'] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(SEED)
# CUDNN options: deterministic helps repeatability but may slow things
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Optional newer safety API
try:
torch.use_deterministic_algorithms(True)
except Exception:
pass
Note: full determinism is often impossible with mixed precision, some GPU kernels, or non-deterministic CUDA ops. The goal is to reduce noise, not banish it entirely.
4) Hardware, cost trade-offs, and instance choices
Pick hardware with the same mindset you used for budgeting: balance cost, availability, and time-to-result.
| GPU family | Memory | Best for | Cost/efficiency note |
|---|---|---|---|
| A100 40GB | 40 GB | Larger QLoRA runs, 8-bit/4-bit PEFT | Very reliable for multi-GPU, priced for enterprise |
| H100 | 80+ GB | Heavy training, best perf | Expensive, great if budget allows |
| RTX 4090 | 24 GB | Single-GPU 4-bit QLoRA experiments | Cheap for fast iteration, limited multi-GPU scale |
Operational tips:
- Use spot/preemptible VMs for cheap experimentation but checkpoint often. Align checkpoint cadence with cost model from 11.15.
- If you negotiated vendor credits, use them on H100s for the heavy sweeps and run cheaper spot A100s for tuning.
5) BitsAndBytes and QLoRA setup caveats
- bitsandbytes requires a matching CUDA + torch combo. Verify compatibility matrix on the bitsandbytes repo.
- For 4-bit QLoRA, enable bnb optimizations and load models with safe tokenizers and safetensors. Avoid using trust_remote_code unless you audited it.
Quick validation commands after environment build:
python -c "import torch; print('cuda',torch.cuda.is_available(), 'version', torch.version.cuda)"
python -c "import bitsandbytes as bnb; print('bnb ok')"
python -c "from transformers import AutoTokenizer; print('hf ok')"
6) Data handling and deterministic dataloaders
- Shuffle with a fixed seed. Example for PyTorch DataLoader: set generator with manual_seed and set num_workers to 0 if you need strict ordering.
- Use fixed preprocessing scripts. Commit preprocessing outputs or record dataset hashes (sha256) to ensure you trained on the same bytes.
from torch.utils.data import DataLoader
seed = torch.Generator()
seed.manual_seed(SEED)
loader = DataLoader(dataset, batch_size=8, shuffle=True, generator=seed, num_workers=0)
7) Checkpoints, experiment tracking, and CI
- Checkpoint frequency is both a reliability and cost decision. More frequent checkpoints cost storage but cut recompute on preemptions. Use your budget model to pick a cadence.
- Track metadata: commit hash, requirements.txt, accelerate config, seed, model config, and dataset checksum. Store these alongside W&B run or MLflow run.
- Add a simple CI job that builds the Docker image and runs a smoke test to confirm the environment loads the model and tokenizes an input.
8) Quick troubleshooting checklist
- nvidia-smi shows the GPU and driver. If absent, check Docker runtime or VM driver installation.
- CUDA/Torch mismatch: ensure torch was installed from the correct wheel for the CUDA version.
- bitsandbytes errors: re-check the CUDA toolkit and bnb build compatibility.
- If results differ unexpectedly, incrementally disable nondeterministic features like mixed_precision and compare.
Closing — TL;DR + a little wisdom
- Pin it, containerize it, seed it, log it. These are your new commandments.
- Use the budgeting lessons from earlier: choose instances that minimize total cost of repeat runs, not just per-hour price. Factor checkpointing, preemption, and vendor credits into the math.
Final thought: reproducibility is not a single switch. It is a discipline. Every saved config, pinned version, and committed Dockerfile is a tiny investment that saves hours, money, and cognitive sanity later. Train that discipline like you train your models: iteratively, and with metrics.
version_name: "Deterministic-ish Lab Setup — Pin It and Ship It"
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!