Courses/Performance-Efficient Fine-Tuning: Mastering Scalable and Cost-Effective LLM Training (How to Tame and Train Your Draconian Language Model)/Quantization, Pruning, and Compression

Quantization, Pruning, and Compression

535 views

Techniques to shrink models and accelerate inference—quantization, pruning, distillation, and end-to-end compression pipelines with attention to accuracy, latency, and hardware support.

Content

2 of 15

5.2 Post-Training Quantization vs Quantization-Aware Training

Quantize This: Sass & Science

186 views

intermediate

humorous

machine-learning

deep-learning

gpt-5-mini

186 views

Versions:

Quantize This: Sass & Science

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

5.2 Post-Training Quantization vs. Quantization-Aware Training — The One Where We Shrink Models Without Losing Our Minds

"Quantization is the adulting of model compression: less glamorous than pruning, but it pays rent." — Probably someone who spent a weekend fighting 8-bit round-off errors

You already saw the basics in 5.1: what quantization does (map floating-point weights/activations to fewer bits) and the usual suspects (per-tensor vs per-channel, weight-only vs activation, symmetric vs asymmetric). Now we level up. This section compares two big strategies: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Think of PTQ as the fast haircut and QAT as the months-long funding of a hair transplant — both can change your looks, but the timelines, costs, and outcomes differ wildly.

TL;DR (Because I know you scrolled)

PTQ: Fast, cheap, often surprisingly effective for many LLM weight-only quantization schemes (GPTQ, AWQ). Uses calibration data (small curated dataset) to set quant scales. Great when you have limited compute or need immediate deployment.
QAT: Slower, costlier, but usually more accurate at extreme bitwidths or when activation quantization is required. You train the model to live with quantization noise — like resilience training for neural networks.

Both rely on good data hygiene — remember sections 4.14 and 4.15: choose a representative, high-quality calibration or training dataset. Garbage in = quantized garbage out.

What each method actually does (short version)

Post-Training Quantization (PTQ)
- Quantize a pre-trained float model after training finishes.
- Use calibration data to determine scaling ranges (min/max, percentiles) and sometimes to compute per-channel scales.
- Tools: GPTQ, AWQ, Intel/Microsoft ONNX/PTQ toolkits.
- Pros: fast, cheap, no retraining. Cons: can lose accuracy at low bits, especially if activations have heavy tails.
Quantization-Aware Training (QAT)
- Insert fake quantization operations during training so the model learns to operate under quantization noise.
- Backpropagate through the rounding operation with straight-through estimators (STEs) or more sophisticated gradients.
- Pros: better accuracy at low bits, robust activation handling. Cons: expensive, requires training data and compute.

When to pick PTQ vs QAT (practical rules)

You need speed and low budget: choose PTQ. Especially fine for weight-only quantization (8-bit or even 4-bit using GPTQ/AWQ) where activations remain in higher precision.
You need the absolute best accuracy at 2–4 bits including activations: choose QAT.
You have a tiny but representative calibration set and want 8-bit everywhere: try PTQ first. If accuracy is unacceptable, escalate to QAT.
Hardware constraints: some accelerators only support limited quant formats — match your quant strategy to hardware.

Quick comparison table

Aspect	PTQ	QAT
Time to apply	Minutes–hours	Hours–days (retraining)
Compute cost	Low	High
Need full training data?	No (small calibration set)	Yes (or at least lots of representative data)
Best for	Weight-only quant, 8-bit	Activation quant, <8-bit, robust accuracy
Typical accuracy loss	Small at 8-bit, grows at ≤4-bit	Small even at very low bits

Calibration data — your secret weapon (and where data curation pays off)

Remember 4.14/4.15? That work of curating a clean, representative sample matters here more than ever. For PTQ, the calibration set should mirror inference inputs: prompts, length distributions, token statistics. A few hundred to a few thousand samples often suffice.

For QAT, the training set should be more extensive and match the downstream distribution. If you can only afford to curate a small set, prefer PTQ and per-channel scaling.

Mini-pipeline cheatsheet

PTQ pseudocode:

# Pseudocode
model = load_fp32_model()
calib_set = load_calibration_data()
for layer in quantizable_layers(model):
    stats = collect_activation_stats(layer, calib_set)
    scale, zero_point = compute_scale_zero(stats, mode='per-channel' or 'per-tensor')
    quantize_weights(layer, scale)
# optionally run a quick reconstruction/GPTQ pass per-layer
save_quantized_model()

QAT pseudocode (high-level):

# Pseudocode
model = load_fp32_model()
insert_fake_quant_ops(model)  # quant-dequant ops on weights/activations
for epoch in range(E):
    for batch in train_loader:
        out = model(batch.inputs)   # quant noise is simulated
        loss = task_loss(out, batch.targets)
        loss.backward()            # STE-gradients through quant ops
        optimizer.step()
save_quantized_model()

Real-world examples (so you know this isn’t academic)

Deploying a chat model on a CPU-based edge device: people often use weight-only PTQ (4–8 bits) with methods like GPTQ/AWQ to squeeze memory and avoid latency hit. Activation stays in float, keeping quality decent.
Serving millions of queries in the cloud to cut cost: consider PTQ first (fast roll-out). If hallucination or correctness degrade, switch to QAT for critical layers.
Building a tiny transformer for on-device personalization at 4-bit end-to-end: QAT almost always wins for usable accuracy.

Common pitfalls and debugging tips

If PTQ breaks specific prompts: your calibration set didn’t cover that distribution. Add representative examples.
If activations have heavy outliers: use clipping strategies (percentile clipping) during PTQ or switch to QAT so the model learns to keep activations tame.
Mixed precision helps: quantize weights heavier than activations; keep embedders or layernorm in float if needed.
Watch hardware: quant formats must match inference kernels. A perfect PTQ model is worthless if your runtime doesn’t support that packing.

Final actionable checklist

Try PTQ first for quick wins — use a well-curated calibration set (link back to 4.14/4.15).
If you need <8-bit or full int inference with activation quantization, plan for QAT (budget it!).
Use per-channel quantization for weights when possible; it often reduces error.
Monitor representative downstream metrics, not just layerwise MSE.

"Quantization is a negotiation: how much accuracy are you willing to trade for cost and latency? The better your calibration data, the fewer arguments you’ll have to make." — Your future, cheaper inference bill

Version notes: start with PTQ, instrument well, then escalate to QAT for stubborn accuracy problems. Your next experiment: take a 7B LLM, run PTQ (AWQ/GPTQ) with 512 calibration prompts from your curated dataset, measure exact-match and perplexity on a held-out set. If things drop >X% (choose your X), run a short QAT schedule on problematic layers.

Key takeaways: PTQ = fast/cheap; QAT = slow/accurate. Data curation makes both work better. Choose the tool that matches your deployment constraints — and bring snacks; QAT days are long.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics