Courses/Performance-Efficient Fine-Tuning: Mastering Scalable and Cost-Effective LLM Training (How to Tame and Train Your Draconian Language Model)/Bonus Labs: Hands-on with Hugging Face PEFT and QLoRA on Llama/Mistral

Bonus Labs: Hands-on with Hugging Face PEFT and QLoRA on Llama/Mistral

404 views

Hands-on, lab-focused learning with real models to solidify PEFT workflows, QLoRA experimentation, and end-to-end fine-tuning that mirrors production setups.

Content

1 of 15

12.1 Lab Setup: Environment and Reproducibility

Deterministic-ish Lab Setup — Pin It and Ship It

99 views

intermediate

humorous

machine learning

engineering

gpt-5-mini

99 views

Versions:

Deterministic-ish Lab Setup — Pin It and Ship It

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Bonus Labs: Hands-on with Hugging Face PEFT and QLoRA on Llama/Mistral

12.1 Lab Setup: Environment and Reproducibility

"If you mess up the environment, your model will be right 100% of the time — wrong in a reproducible way."

You already learned to budget for bug bashes, negotiate vendor credits, and measure capex vs opex in the Cost Modeling arc. Good. Now we convert that fiscal wisdom into something practical: an environment that is stable, repeatable, and cheap enough to not bankrupt your team's snack fund. This lab gets you from chaos to deterministic(ish) training runs for PEFT and QLoRA on Llama or Mistral-style weights.

Why this matters (quick recap to connect to previous units)

Cost modeling taught you which instance types and preemption strategies save money. But if your environment is flaky, those savings are eaten by wasted runs and mysterious performance regressions.
Vendor negotiation might have secured model licenses or infra credits. Use reproducibility to actually spend those credits wisely — not on reruns chasing nondeterministic failures.

In short: tight environments = predictable spend + predictable outcomes. Now let us provision that predictability.

1) Core principles for reproducible PEFT/QLoRA labs

Pin everything: Python, CUDA, torch, bitsandbytes, transformers, peft, accelerate. Versions matter.
Containerize: Docker keeps your local machine from being the wild card. Use the same container on dev and CI.
Seed everything: Python random, NumPy, Torch, dataloader workers, and any library RNGs.
Document and log: commit a requirements file, accelerate config, and a small README. Use W&B or MLflow for run metadata and config.

2) Example environment artifacts

Below are practical snippets to drop into your repo. Treat them as templates — not holy scripture.

Minimal requirements.txt (pin versions)

torch==2.2.0
transformers==4.33.2
accelerate==0.20.3
peft==0.4.0
bitsandbytes==0.41.0
safetensors==0.4.2
datasets==2.13.0
tokenizers==0.15.2
wandb==0.15.2
numpy==1.26.0

Tip: pip freeze > requirements.txt after setting up a golden env, then use pip install -r requirements.txt for reproducibility.

Dockerfile skeleton

FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y git python3-pip
COPY requirements.txt /tmp/
RUN python3 -m pip install --upgrade pip
RUN pip install -r /tmp/requirements.txt
WORKDIR /workspace

accelerate config (accelerate config defaulted via CLI is fine too)

compute_environment: LOCAL_MACHINE
distributed_type: NO
mixed_precision: bf16
num_processes: 1

3) Seed and deterministic settings (Python snippet)

import os
import random
import numpy as np
import torch

SEED = 42
os.environ['PYTHONHASHSEED'] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# CUDNN options: deterministic helps repeatability but may slow things
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Optional newer safety API
try:
    torch.use_deterministic_algorithms(True)
except Exception:
    pass

Note: full determinism is often impossible with mixed precision, some GPU kernels, or non-deterministic CUDA ops. The goal is to reduce noise, not banish it entirely.

4) Hardware, cost trade-offs, and instance choices

Pick hardware with the same mindset you used for budgeting: balance cost, availability, and time-to-result.

GPU family	Memory	Best for	Cost/efficiency note
A100 40GB	40 GB	Larger QLoRA runs, 8-bit/4-bit PEFT	Very reliable for multi-GPU, priced for enterprise
H100	80+ GB	Heavy training, best perf	Expensive, great if budget allows
RTX 4090	24 GB	Single-GPU 4-bit QLoRA experiments	Cheap for fast iteration, limited multi-GPU scale

Operational tips:

Use spot/preemptible VMs for cheap experimentation but checkpoint often. Align checkpoint cadence with cost model from 11.15.
If you negotiated vendor credits, use them on H100s for the heavy sweeps and run cheaper spot A100s for tuning.

5) BitsAndBytes and QLoRA setup caveats

bitsandbytes requires a matching CUDA + torch combo. Verify compatibility matrix on the bitsandbytes repo.
For 4-bit QLoRA, enable bnb optimizations and load models with safe tokenizers and safetensors. Avoid using trust_remote_code unless you audited it.

Quick validation commands after environment build:

python -c "import torch; print('cuda',torch.cuda.is_available(), 'version', torch.version.cuda)"
python -c "import bitsandbytes as bnb; print('bnb ok')"
python -c "from transformers import AutoTokenizer; print('hf ok')"

6) Data handling and deterministic dataloaders

Shuffle with a fixed seed. Example for PyTorch DataLoader: set generator with manual_seed and set num_workers to 0 if you need strict ordering.
Use fixed preprocessing scripts. Commit preprocessing outputs or record dataset hashes (sha256) to ensure you trained on the same bytes.

from torch.utils.data import DataLoader
seed = torch.Generator()
seed.manual_seed(SEED)
loader = DataLoader(dataset, batch_size=8, shuffle=True, generator=seed, num_workers=0)

7) Checkpoints, experiment tracking, and CI

Checkpoint frequency is both a reliability and cost decision. More frequent checkpoints cost storage but cut recompute on preemptions. Use your budget model to pick a cadence.
Track metadata: commit hash, requirements.txt, accelerate config, seed, model config, and dataset checksum. Store these alongside W&B run or MLflow run.
Add a simple CI job that builds the Docker image and runs a smoke test to confirm the environment loads the model and tokenizes an input.

8) Quick troubleshooting checklist

nvidia-smi shows the GPU and driver. If absent, check Docker runtime or VM driver installation.
CUDA/Torch mismatch: ensure torch was installed from the correct wheel for the CUDA version.
bitsandbytes errors: re-check the CUDA toolkit and bnb build compatibility.
If results differ unexpectedly, incrementally disable nondeterministic features like mixed_precision and compare.

Closing — TL;DR + a little wisdom

Pin it, containerize it, seed it, log it. These are your new commandments.
Use the budgeting lessons from earlier: choose instances that minimize total cost of repeat runs, not just per-hour price. Factor checkpointing, preemption, and vendor credits into the math.

Final thought: reproducibility is not a single switch. It is a discipline. Every saved config, pinned version, and committed Dockerfile is a tiny investment that saves hours, money, and cognitive sanity later. Train that discipline like you train your models: iteratively, and with metrics.

version_name: "Deterministic-ish Lab Setup — Pin It and Ship It"

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics