Bonus Labs: Hands-on with Hugging Face PEFT and QLoRA on Llama/Mistral
Hands-on, lab-focused learning with real models to solidify PEFT workflows, QLoRA experimentation, and end-to-end fine-tuning that mirrors production setups.
Content
12.2 Quickstart: PEFT with LoRA on Llama 2
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
12.2 Quickstart: PEFT with LoRA on Llama 2
Hands-on, pragmatic, and slightly caffeinated — we're taking Llama 2 for a micro-tune with PEFT's LoRA so you can get big-model results without selling your cloud credits to a mysterious island nation.
You already set up the environment and reproducibility in 12.1, and you remember how cost modeling (from the budgeting chapters) keeps us humble about GPU choices. Good — we’re building on that. This quickstart skips the fluff and shows the fast path from "I have Llama 2" to "I have a LoRA adapter I can deploy." Expect code snippets, real-world tips, and a few metaphors that will stick.
Why LoRA again? (The TL;DR economics)
- LoRA (Low-Rank Adaptation) lets you only train small rank matrices attached to large weight matrices instead of the whole billions-parameter model.
- Result: massively reduced trainable parameters, lower GPU memory, shorter turnaround, and far lower opex compared to full fine-tuning — exactly the levers we discussed in the cost-modeling chapters.
Imagine tuning a 7B model like swapping the engine’s spark plugs instead of rebuilding the engine.
Quick checklist before you start
- You followed 12.1: environment reproducible, HF token available, deterministic seed set. ✅
- You considered costs: GPU type, spot pricing, and how many hours you can afford (see 11.15 & 11.14). ✅
- You have a small training set (instructions or few-shot examples) — start tiny and iterate. ✅
Installation (one-liners)
pip install --upgrade transformers accelerate datasets peft bitsandbytes huggingface-hub
BitsAndBytes gives you 8-bit/4-bit loading (critical for memory) and PEFT is the Hugging Face library for adapters like LoRA.
Minimal script: PEFT + LoRA on Llama 2 (conceptual, copy-paste friendly)
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
MODEL = "meta-llama/Llama-2-7b-hf" # or your HF path
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=False)
# load in 8-bit to save memory (requires bitsandbytes)
model = AutoModelForCausalLM.from_pretrained(
MODEL,
load_in_8bit=True,
device_map="auto",
)
# make the model friendly for 8-bit training
model = prepare_model_for_int8_training(model)
lora_config = LoraConfig(
r=8, # LoRA rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# load a tiny dataset (replace with your instruction tuning JSONL)
dataset = load_dataset("huggingface/instruction-following-demo", split="train[:200]")
# tokenize and train with small batch sizes + gradient accumulation
# (Data processing omitted for brevity)
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
output_dir="./lora-llama2-quick",
save_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset_tokenized,
data_collator=data_collator,
)
trainer.train()
model.save_pretrained("./lora-llama2-quick")
Note: replace dataset and tokenization bits with your actual preprocessing. The above shows the critical pieces: 8-bit loading, prepare_model_for_int8_training, LoraConfig, get_peft_model.
Key hyperparameters and pragmatic defaults
- r (rank): 4–16. Lower = cheaper, higher = more capacity. Start at 8.
- alpha: 16–64. Scales the update. 32 is a solid default.
- target_modules: pick the attention projections (e.g.,
q_proj,v_proj) — these capture much of the model's behavior. If you're feeling bold, includek_proj,o_proj, or MLPgate_proj/up_projfor instruction-style tuning. - lora_dropout: 0.05–0.2 to regularize.
- batch/accum: use small per-device batch (1–4) with gradient_accumulation to simulate larger batches.
Why target only some modules? Because the attention matrices are where the model routes meaning — adding LoRA here gives disproportionate impact for fewer parameters.
Mini table: rough resource comparison (illustrative)
| Method | Trainable params | GPU mem footprint | Typical cost (7B model) |
|---|---|---|---|
| Full fine-tune | ~7B | Very high (≥A100 80GB class) | $$$$ |
| LoRA (r=8) | ~tens of millions | Low (fits on 40GB/80GB with 8-bit) | $ |
Numbers are illustrative — the point: LoRA cuts trainable params by orders of magnitude, which translates to huge opex savings.
Practical tips & gotchas
- Use prepare_model_for_int8_training when loading in 8-bit — it modifies certain layers for stable adapter training.
- Seed everything (you did that in 12.1). Small datasets + nondeterminism = chaos.
- Save LoRA adapters separately (model.save_pretrained) — pushing the adapter to the Hub is tiny and fast.
- If your eval metrics wobble: try increasing r, or include MLP target modules.
- For long instruction fields, use dynamic padding/data collator to avoid wasting tokens and memory.
How this ties back to cost modeling & vendor decisions
- Lower GPU-hours -> smaller budgets and more experiments -> faster iteration. This is the same loop we prioritized in the budgeting sections.
- If a vendor offers managed fine-tuning, compare adapter training cost vs full-fine-tune cost. Adapters win more often, and you can negotiate hosting of the adapter + inference SLA (see 11.14 on negotiation tactics).
Quick troubleshooting checklist
- OOM during training: enable load_in_8bit/load_in_4bit, reduce r, enable gradient_checkpointing, lower batch size.
- No improvement in eval: increase r or add MLP modules; check dataset quality and label noise.
- Very slow: use mixed precision, gradient accumulation, or use a faster optimizer like Lion via custom loops.
Closing: Key takeaways (read these and go experiment)
- LoRA + PEFT = speed + thrift: you get most of the fine-tuning benefit for a fraction of compute and cost.
- Start small, iterate quickly: small datasets, low r, and short runs give you signal fast — exactly the efficiency playbook we covered in cost modeling.
- Adapters are portable: save and share the LoRA adapter — tiny artifacts that can be deployed independently.
Final pro-tip: treat LoRA experiments like rapid product hypotheses. Test one variable at a time (rank, target modules, dataset size), track hours and costs, and you’ll actually be able to deliver improvements instead of running infinite, expensive experiments.
Now go fire up a GPU, and show Llama some tasteful direction. You’re not rebuilding the beast — you’re whispering into its ear.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!