Foundations of Generative AI
Establish how modern LLMs generate text, the role of tokens and probabilities, and the constraints that shape prompt behavior.
Content
Probabilities and Next-Token Prediction
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Foundations of Generative AI: Probabilities and Next-Token Prediction
You already know what a token is and how transformers move attention around like a drama queen. Now we ask the machine a slightly less dramatic but more useful question: what is the most likely next token given everything it's seen so far?
Hook: Imagine the model as a nervous fortune teller
You give it a sequence of tokens (remember tokenization from the previous module) and it whispers probabilities for every possible next token. It does not pick a single answer in its head and then pretend nothing else exists — it assigns a probability to each token, like a weather forecast that says 70% chance of sunshine, 20% rain, 10% meteorite. That distribution is the beating heart of generative models.
Why care? Because every generation decision — greedy, random sampling, top-k, top-p, or beam search — comes from this probability distribution. Mess with the distribution, and you change the model's personality: bland, creative, repetitive, or surprisingly poetic.
Quick recap: where probabilities come from (building on the transformer primer)
- The transformer gives each candidate token a score called a logit via a final linear layer applied to the decoder's hidden state.
- Those logits are converted into a probability distribution using the softmax function.
- That distribution is the model saying, for each token t: p(t | context) = probability of t being the next token given everything before it.
This uses the tokenization you already know: the set of candidate tokens is the vocabulary created during tokenization.
The math you actually need (but friendly)
Softmax turns logits into probabilities. If logits are z1, z2, ..., zN then
softmax(zi) = exp(zi) / sum_j exp(zj)
Concrete tiny example:
logits = [2.0, 1.0, 0.1]
exp = [e^2.0, e^1.0, e^0.1] ≈ [7.39, 2.72, 1.11]
sum ≈ 11.22
probs ≈ [0.659, 0.243, 0.099]
So the first token gets about 66% probability, second 24%, third 10%.
Temperature changes the mood. Divide logits by temperature T before softmax:
modified_logits = logits / T
- T < 1 -> sharper distribution, more confident, more greedy
- T > 1 -> flatter distribution, more diverse, more creative
Think of temperature as the volume knob on the model's indecisiveness.
Sampling methods: how we turn probabilities into actual tokens
Why different methods? Because probability alone doesn't dictate how we should convert that distribution into a token. Different strategies trade off creativity, coherence, and computational cost.
Greedy
- Pick the argmax token every time.
- Pros: deterministic, fast. Cons: boring and prone to loops.
Temperature sampling
- Sample from softmax(logits / T).
- Pros: easy, tunable. Cons: can still pick very unlikely tokens when T is high.
Top-k sampling
- Keep only the k highest-probability tokens, renormalize, sample.
- Pros: removes extremely low-probability noise. Cons: fixed k may be awkward for long tails.
Top-p (nucleus) sampling
- Keep the smallest set of tokens whose cumulative probability ≥ p, renormalize, sample.
- Pros: adaptive; keeps enough tokens to reach desired mass. Cons: slightly more compute.
Beam search
- Keep multiple candidate sequences (beams) and expand them, picking highest-scoring final sequences.
- Pros: better for tasks requiring global coherence (translation). Cons: can be too conservative and produce generic outputs.
Table: quick comparison
| Method | Creativity | Determinism | When to use |
|---|---|---|---|
| Greedy | Low | Deterministic | Short answers, strict constraints |
| Temperature | Medium to high | Stochastic | Story generation with tunable diversity |
| Top-k | Medium | Stochastic | Remove tiny tails, faster sampling |
| Top-p | Medium-high | Stochastic | Best general-purpose sampler |
| Beam | Low-medium | Deterministic-ish | Translation, summarization when quality matters |
Loss, evaluation, and the unpleasant truth: cross entropy and perplexity
During training the model is optimized to assign high probability to the true next token. The standard loss is cross-entropy:
loss = -sum over tokens (one_hot_true * log(prob_true))
Perplexity is an exponentiated average loss, roughly "how surprised the model is on average":
perplexity = exp(average negative log-likelihood)
Lower perplexity = better next-token prediction. But remember: low perplexity doesn't mean good creative writing. It means the model is good at matching training distribution.
Common misconceptions
- People say the model "knows" the next token. Correction: the model computes a distribution of plausibilities. It does not "choose" until sampling or decoding happens.
- High confidence (peaked distribution) is not the same as correctness. Models can be confidently wrong.
- Temperature fixes everything. Not true. Temperature reshapes distribution but doesn't fix missing knowledge or biases in training data.
Tiny thought experiment
You ask a model to finish the sentence: "The secret ingredient is"
- If the model has seen tons of recipe text, probabilities will favor foods like "salt" or "love" depending on corpus.
- Lower T yields the safe, likely completion: "salt".
- Higher T might produce cheeky completions like "time travel".
Ask yourself: what does the selection imply about the training data? You're seeing the training distribution leaking into outputs.
Practical tips for prompt engineers
- If you want reliability and repeatability, start with greedy or low-T sampling.
- For creative tasks, use top-p ~ 0.9 with T in 0.7-1.0 range as a default.
- Watch for repetition. If the model loops, decrease T or use top-k/top-p.
- If you need the model to produce something specific, bias logits via prompting or logit adjustments rather than hoping sampling will cooperate.
Pro tip: small changes in prompt often give bigger changes in distribution than fiddling with temperature by 0.1. The prompt writes the prior; decoding policies shape the noise.
Closing: Why this matters for prompt engineering
Next-token probabilities are the language model's raw intentions. As a prompt engineer, you are designing the context and picking the decoding policy that turns those intentions into outputs. Understand softmax, temperature, and sampling strategies, and you go from being a lucky guesser to a principled creator of outputs.
Key takeaways:
- The transformer produces logits; softmax turns them into probabilities over the tokenized vocabulary.
- Sampling strategy plus temperature determine creativity vs reliability.
- Cross-entropy and perplexity tell you how good the model is at predicting tokens, not at being interesting.
Want to practice? Try a tiny experiment: take a prompt, get logits for next token, manually compute softmax, then sample with different temperatures and top-p values. Seeing the numbers and outcomes side by side will make this all click.
Version next: we can dig into logit manipulation, safety filters, or how to condition distributions via control tokens. Which chaotic rabbit hole would you like to jump into next?
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!