Quantization, Pruning, and Compression
Techniques to shrink models and accelerate inference—quantization, pruning, distillation, and end-to-end compression pipelines with attention to accuracy, latency, and hardware support.
Content
5.2 Post-Training Quantization vs Quantization-Aware Training
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
5.2 Post-Training Quantization vs. Quantization-Aware Training — The One Where We Shrink Models Without Losing Our Minds
"Quantization is the adulting of model compression: less glamorous than pruning, but it pays rent." — Probably someone who spent a weekend fighting 8-bit round-off errors
You already saw the basics in 5.1: what quantization does (map floating-point weights/activations to fewer bits) and the usual suspects (per-tensor vs per-channel, weight-only vs activation, symmetric vs asymmetric). Now we level up. This section compares two big strategies: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Think of PTQ as the fast haircut and QAT as the months-long funding of a hair transplant — both can change your looks, but the timelines, costs, and outcomes differ wildly.
TL;DR (Because I know you scrolled)
- PTQ: Fast, cheap, often surprisingly effective for many LLM weight-only quantization schemes (GPTQ, AWQ). Uses calibration data (small curated dataset) to set quant scales. Great when you have limited compute or need immediate deployment.
- QAT: Slower, costlier, but usually more accurate at extreme bitwidths or when activation quantization is required. You train the model to live with quantization noise — like resilience training for neural networks.
Both rely on good data hygiene — remember sections 4.14 and 4.15: choose a representative, high-quality calibration or training dataset. Garbage in = quantized garbage out.
What each method actually does (short version)
Post-Training Quantization (PTQ)
- Quantize a pre-trained float model after training finishes.
- Use calibration data to determine scaling ranges (min/max, percentiles) and sometimes to compute per-channel scales.
- Tools: GPTQ, AWQ, Intel/Microsoft ONNX/PTQ toolkits.
- Pros: fast, cheap, no retraining. Cons: can lose accuracy at low bits, especially if activations have heavy tails.
Quantization-Aware Training (QAT)
- Insert fake quantization operations during training so the model learns to operate under quantization noise.
- Backpropagate through the rounding operation with straight-through estimators (STEs) or more sophisticated gradients.
- Pros: better accuracy at low bits, robust activation handling. Cons: expensive, requires training data and compute.
When to pick PTQ vs QAT (practical rules)
- You need speed and low budget: choose PTQ. Especially fine for weight-only quantization (8-bit or even 4-bit using GPTQ/AWQ) where activations remain in higher precision.
- You need the absolute best accuracy at 2–4 bits including activations: choose QAT.
- You have a tiny but representative calibration set and want 8-bit everywhere: try PTQ first. If accuracy is unacceptable, escalate to QAT.
- Hardware constraints: some accelerators only support limited quant formats — match your quant strategy to hardware.
Quick comparison table
| Aspect | PTQ | QAT |
|---|---|---|
| Time to apply | Minutes–hours | Hours–days (retraining) |
| Compute cost | Low | High |
| Need full training data? | No (small calibration set) | Yes (or at least lots of representative data) |
| Best for | Weight-only quant, 8-bit | Activation quant, <8-bit, robust accuracy |
| Typical accuracy loss | Small at 8-bit, grows at ≤4-bit | Small even at very low bits |
Calibration data — your secret weapon (and where data curation pays off)
Remember 4.14/4.15? That work of curating a clean, representative sample matters here more than ever. For PTQ, the calibration set should mirror inference inputs: prompts, length distributions, token statistics. A few hundred to a few thousand samples often suffice.
For QAT, the training set should be more extensive and match the downstream distribution. If you can only afford to curate a small set, prefer PTQ and per-channel scaling.
Mini-pipeline cheatsheet
PTQ pseudocode:
# Pseudocode
model = load_fp32_model()
calib_set = load_calibration_data()
for layer in quantizable_layers(model):
stats = collect_activation_stats(layer, calib_set)
scale, zero_point = compute_scale_zero(stats, mode='per-channel' or 'per-tensor')
quantize_weights(layer, scale)
# optionally run a quick reconstruction/GPTQ pass per-layer
save_quantized_model()
QAT pseudocode (high-level):
# Pseudocode
model = load_fp32_model()
insert_fake_quant_ops(model) # quant-dequant ops on weights/activations
for epoch in range(E):
for batch in train_loader:
out = model(batch.inputs) # quant noise is simulated
loss = task_loss(out, batch.targets)
loss.backward() # STE-gradients through quant ops
optimizer.step()
save_quantized_model()
Real-world examples (so you know this isn’t academic)
- Deploying a chat model on a CPU-based edge device: people often use weight-only PTQ (4–8 bits) with methods like GPTQ/AWQ to squeeze memory and avoid latency hit. Activation stays in float, keeping quality decent.
- Serving millions of queries in the cloud to cut cost: consider PTQ first (fast roll-out). If hallucination or correctness degrade, switch to QAT for critical layers.
- Building a tiny transformer for on-device personalization at 4-bit end-to-end: QAT almost always wins for usable accuracy.
Common pitfalls and debugging tips
- If PTQ breaks specific prompts: your calibration set didn’t cover that distribution. Add representative examples.
- If activations have heavy outliers: use clipping strategies (percentile clipping) during PTQ or switch to QAT so the model learns to keep activations tame.
- Mixed precision helps: quantize weights heavier than activations; keep embedders or layernorm in float if needed.
- Watch hardware: quant formats must match inference kernels. A perfect PTQ model is worthless if your runtime doesn’t support that packing.
Final actionable checklist
- Try PTQ first for quick wins — use a well-curated calibration set (link back to 4.14/4.15).
- If you need <8-bit or full int inference with activation quantization, plan for QAT (budget it!).
- Use per-channel quantization for weights when possible; it often reduces error.
- Monitor representative downstream metrics, not just layerwise MSE.
"Quantization is a negotiation: how much accuracy are you willing to trade for cost and latency? The better your calibration data, the fewer arguments you’ll have to make." — Your future, cheaper inference bill
Version notes: start with PTQ, instrument well, then escalate to QAT for stubborn accuracy problems. Your next experiment: take a 7B LLM, run PTQ (AWQ/GPTQ) with 512 calibration prompts from your curated dataset, measure exact-match and perplexity on a held-out set. If things drop >X% (choose your X), run a short QAT schedule on problematic layers.
Key takeaways: PTQ = fast/cheap; QAT = slow/accurate. Data curation makes both work better. Choose the tool that matches your deployment constraints — and bring snacks; QAT days are long.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!