Quantization, Pruning, and Compression
Techniques to shrink models and accelerate inference—quantization, pruning, distillation, and end-to-end compression pipelines with attention to accuracy, latency, and hardware support.
Content
5.3 8-bit, 4-bit and Beyond
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
5.3 — 8-bit, 4-bit and Beyond: The Wild West of Low‑Bit LLMs
"Quantize like a monk: less weight, more light — but don’t drop the scripture." — your future-debugging self
You already saw the basics (5.1) and the important split between Post‑Training Quantization (PTQ) and Quantization‑Aware Training (QAT) (5.2). Now we get into the part where engineers lean over their laptops, whisper to the GPU, and try to make a trillion-parameter beast sleep in a 16GB dorm room: 8‑bit, 4‑bit, and the experimental beyond. This section builds on those concepts and the earlier module on data efficiency: when you squash models into fewer bits, your curated data matters even more — because every bit of noise gets louder.
TL;DR (for the impatient but curious)
- 8‑bit (int8): The practical, low‑risk first step. Good memory savings (~2× vs fp16) with small model-quality drop when done sensibly. Great for inference and sometimes fine‑tuning with libs like bitsandbytes.
- 4‑bit (nf4, etc.): Much higher memory savings (~4×). Enables fine‑tuning on consumer GPUs (see QLoRA). Requires smarter PTQ (GPTQ/AWQ) or QAT; more brittle but powerful.
- Below 4‑bit (3/2/1‑bit): Research territory. Potentially massive compression, but large quality regressions and heavy engineering cost. Use only if you have a compelling hardware or latency constraint and are prepared for a research sprint.
Why different bit widths? Because hardware + economics
- Memory: bits reduce storage. 8‑bit halves memory vs fp16; 4‑bit halves 8‑bit again.
- Throughput: depends on optimized kernels and hardware support. Not all GPUs accelerate sub‑8‑bit ops natively, so software emulation or custom kernels (bitsandbytes, GPTQ CUDA kernels, AWQ) matter.
- Accuracy: fewer bits = heavier rounding; some weights or activations are particularly sensitive (outliers), so naive quantization breaks things.
Think of 8‑bit as moving from a king‑sized bed to a queen: comfortable. 4‑bit is a futon. Beyond that, you're sleeping in a hammock over a volcano — thrilling, but risky.
The practical landscape: methods and tools
- Naïve PTQ (per‑tensor symmetric/asymmetric): Quick, broad strokes; fine for 8‑bit on many layers but fails when activations/weights have heavy tails.
- Per‑channel / block quantization: Partition weights to adapt ranges per output channel or block. Helps on heterogeneous weight distributions.
- Double quantization (bitsandbytes): A trick to get extra compression with limited accuracy loss; common in 4‑bit toolchains.
- GPTQ (Post‑Training, second‑order): Uses curvature/second‑order information to minimize final error introduced by quantization. Great for 4‑bit PTQ in LLMs — often used to produce high‑quality 4‑bit weights with modest compute.
- AWQ (Accuracy‑Weighted Quantization) and SmoothQuant: Recent PTQ improvements that adapt quantization choices to layer characteristics and activation distributions. Useful to tame outliers.
- QLoRA: Practical pipeline combining 4‑bit quantized base model + LoRA adapters to fine‑tune on consumer hardware. A great example of marrying quantization and parameter‑efficient fine‑tuning.
Libraries to know: bitsandbytes (broad support for 8/4‑bit and custom kernels), GPTQ/AWQ implementations (community/academic repos), Hugging Face/transformers integrations for QLoRA workflows.
When to use which bit width (practical recipe)
- Start with 8‑bit for inference
- Fast-to-adopt. Minimal degradation on many models.
- Useful baseline for memory and latency gains.
- Move to 4‑bit when you need bigger jumps in footprint
- Use PTQ + GPTQ/AWQ for inference deployments.
- For fine‑tuning: consider QLoRA (4‑bit base + LoRA) or QAT if you can afford training passes.
- Only consider sub‑4‑bit if
- You’re building extremely resource‑constrained edge systems, or
- You are in a research/optimization project and can invest in calibration and custom kernels.
Data, calibration, and why your curation efforts pay off more now
Remember the earlier module on Data Efficiency? That work pays dividends here:
- PTQ/GPTQ need a calibration set — representative samples that expose typical activation ranges and outliers. If your dataset is noisy or unrepresentative, the quantizer will calibrate to garbage.
- When you fine‑tune low‑bit models (e.g., QLoRA), highly curated, high‑signal data prevents quantized fine‑tuning from amplifying mistakes.
Practical tip: use a small (hundreds to low thousands) but high‑quality calibration set that mirrors production input distribution.
Quick diagnostic checklist (if results are bad)
- Are attention or embedding layers quantized? Try keeping them in higher precision.
- Do you see logit shifts or hallucination spike? Try recalibrating or using per‑channel quantization for problematic layers.
- Are outliers present? Consider SmoothQuant or clipping strategies.
- Is runtime slower than expected? Check whether quant/dequant overheads or unoptimized kernels are the bottleneck.
Minimal workflow (pseudocode) — go from base model to deployable quantized + fine‑tuned model
# 1. Evaluate layer sensitivity using small eval set
sensitivity_map = estimate_layer_sensitivity(base_model, eval_samples)
# 2. Choose quant scheme per layer (8-bit for sensitive, 4-bit elsewhere)
quant_plan = make_quant_plan(sensitivity_map, target_memory)
# 3. Calibrate PTQ (or run GPTQ/AWQ) using representative calibration set
quantized_model = apply_gptq_or_awq(base_model, calibration_data, quant_plan)
# 4. (Optional) Fine-tune adapters (LoRA) on curated dataset in 4-bit (QLoRA)
adapters = train_lora(quantized_model, fine_tune_data)
# 5. Validate: perplexity, end-task metrics, and qualitative checks
validate(quantized_model + adapters, validation_set)
# 6. Deploy with runtime that supports your quant kernels
Pitfalls & guardrails (don’t be that person who deploys blindly)
- Don’t trust a single metric (perplexity). Test downstream tasks, safety checks, and hallucination rates.
- Beware of tokenization artifacts after quantization — some models shift token logits in subtle ways.
- Keep a fallback: if a layer is too sensitive, keep it at fp16. Mixed precision is your friend.
Closing — the pragmatic ethos
Low‑bit quantization is not a magic spell. It’s a toolkit: 8‑bit is the dependable spork; 4‑bit is the Swiss Army knife; below 4‑bit is experimental rocket fuel. Use representative calibration data (you already learned how to curate it), measure layer sensitivity, and pair quantization with parameter‑efficient fine‑tuning like LoRA when you need to train on limited hardware.
Final bite: Always test in the conditions your model will live in. Memory savings are seductive; reliability is non‑negotiable.
Key takeaways:
- Start conservative (8‑bit), escalate only when you have calibration + evaluation pipelines.
- Use PTQ advances (GPTQ/AWQ/SmoothQuant) to make 4‑bit usable for real LLMs.
- Your curated data and calibration set determine quantization success more than fancy kernels.
Version note: this is the "wildly practical TA" edition — take it, tune it, and go make a big model live happily in a small GPU.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!