Foundations of Generative AI
Establish how modern LLMs generate text, the role of tokens and probabilities, and the constraints that shape prompt behavior.
Content
Temperature and Top-p Sampling
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Temperature and Top-p Sampling — The Flavor Knobs of Creativity
"The model already knows what comes next. Your job is to choose how adventurous its guess should be."
You already know from the previous sections how transformers predict the next token by assigning probabilities over the vocabulary and how tokens are the atomic pieces of text. Now we get to the fun part: how we sample from that probability cocktail to produce text. Two of the most important controls are temperature and top-p (nucleus) sampling. These are the dials that decide whether the model plays it safe or goes rogue with poetic chaos.
Why this matters (quick reminder)
When we talk about next-token prediction, the model calculates a score (logit) for every possible token, converts those scores into probabilities via softmax, and then we pick a token. How we pick it changes the entire personality of the output. Deterministic picks give boring but stable outputs; stochastic picks add variety and creativity.
Think of it this way: the model hands you a ranked list of plausible next tokens with probabilities. Temperature and top-p are different policies that decide how likely we are to choose risky options from that list.
Temperature: the randomness knob
- What it is: A scalar T > 0 applied to logits before softmax. Mathematically, p_i = softmax(z_i / T), where z_i are logits.
- Intuition: Lower T concentrates probability on the highest-scoring tokens (less random). Higher T flattens the distribution, making rare tokens more likely (more random).
Analogy: Temperature is like a thermostat for your writing style. At T = 0.1 the model behaves like a stoic librarian; at T = 1.5 it behaves like a caffeinated poet who just discovered metaphors.
Examples:
- T -> 0: argmax sampling, almost deterministic, repetitive
- T = 0.7: balanced creativity (common default)
- T >= 1.2: high creativity, risk of nonsense or contradictions
Pro tip: temperature is applied to logits, not to probabilities. Practically you divide logits by T, then run softmax, then sample.
Top-p (nucleus) sampling: the selective crowd
- What it is: Sort tokens by probability. Take the smallest set of tokens whose cumulative probability >= p. Then normalize those token probabilities and sample from them.
- Intuition: Instead of picking a fixed number of top tokens, top-p picks a dynamic, probability-mass-based window. It keeps enough tokens to cover p of the distribution and ignores the long tail.
Analogy: Imagine picking songs for a party. Top-k is like saying "only play the top 10 hits." Top-p is like saying "play whatever songs account for 90% of my guests' favorite-list votes." Sometimes that set is 3 songs, sometimes 20 — flexible and context-aware.
Why this is handy: it avoids both the tyranny of tiny top-k lists and the chaos of sampling from the full tail. It adapts to how peaky or flat the model's distribution is.
Temperature vs Top-p vs Top-k: quick table
| Feature | Temperature | Top-p | Top-k |
|---|---|---|---|
| Controls randomness | Yes, smooth control | Indirectly, by limiting tail | Hard cutoff on rank |
| Adaptive to distribution | No | Yes | No |
| Risk of nonsensical tokens | Increases with T | Controlled by p | Controlled by k |
| Typical use case | Fine-tuning creativity | Stable randomness with context | Simple, fast control |
How they work together (and in what order)
Typical pipeline:
- Compute logits from the model for each token.
- Optionally divide logits by temperature T.
- Convert to probabilities via softmax.
- Apply top-p: determine nucleus set and re-normalize.
- Sample one token from the resulting distribution.
You can combine temperature and top-p. A common pattern: use T around 0.6-1.0 and p around 0.8-0.95.
Pseudo-code:
# logits: vector of shape vocab_size
scaled_logits = logits / T
probs = softmax(scaled_logits)
# select nucleus
sorted_tokens = sort_by_prob_descending(probs)
cumulative = 0
nucleus = []
for token in sorted_tokens:
nucleus.append(token)
cumulative += probs[token]
if cumulative >= p:
break
nucleus_probs = renormalize(probs over nucleus)
next_token = sample_from(nucleus_probs)
Practical tips and defaults
- Start with T = 0.7 and p = 0.9 for creative writing.
- For factual or instruction-following outputs, lower T (0.0 to 0.5) and/or use greedy decoding.
- If you see repetition or looped phrases, reduce T or use nucleus sampling with a tighter p.
- If the model produces awkward subword artifacts, that can come from high T interacting with tokenization. Subword tokens multiplied make gibberish more likely at high T.
Questions to ask yourself:
- Do I value correctness or diversity more in this prompt?
- Is the task short and precise, or open-ended and creative?
Common failure modes
- High temperature + high p -> plausible sounding but factually wrong hallucinations.
- Low temperature + strict top-p -> overly conservative or repetitive outputs.
- Tokenization artifacts: sampling at token-level may break words if you land on rare subword tokens. If that matters, consider controlling at higher-level constraints or using beam search for strict coherence.
Quick decision guide
- Need precise, reliable answer? Use low T or greedy decoding.
- Want variety but coherent text? Use moderate T (0.6-0.9) with p = 0.8-0.95.
- Want maximum creativity and flavor? Increase T, maybe relax p.
Closing: the creative control panel
Temperature and top-p are not mystical hacks. They are interpretable levers that trade off confidence vs creativity. They sit on top of the probabilities you learned about in the next-token section and operate at the token level shaped by tokenization.
Smart takeaway: treat temperature as the "adventurousness" knob and top-p as the "only let the sensible crowd speak" filter. Use them together to craft model behavior that fits your goal: conservative when you need correctness, playful when you need ideas.
Key takeaways:
- Temperature scales logits and directly adjusts randomness.
- Top-p dynamically selects a probability mass nucleus to sample from.
- Combining both gives flexible, controllable outputs.
Go test it: change T a little, nudge p a fraction, and watch the same model transform from a reliable assistant to a freewheeling novelist. That tiny twist is where prompt engineering turns into an art form.
Tags: beginner, humorous, science
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!