Generative AI: Prompt Engineering Basics

Chapters

1Foundations of Generative AI

What Is Generative AI AI vs ML vs Deep Learning Transformer Architecture Primer Tokens and Tokenization Probabilities and Next-Token Prediction Temperature and Top-p Sampling Context Window and Limits Prompt–Response Loop System, Developer, and User Messages Capabilities and Limitations Hallucinations and Uncertainty Determinism vs Stochasticity Safety Layers and Moderation Evaluation Mindset from Day One Useful Mental Models of LLMs

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Foundations of Generative AI

Foundations of Generative AI

21730 views

Establish how modern LLMs generate text, the role of tokens and probabilities, and the constraints that shape prompt behavior.

Content

5 of 15

Probabilities and Next-Token Prediction

Probabilities but Make It Cheeky

1294 views

beginner

humorous

visual

science

gpt-5-mini

1294 views

Versions:

Probabilities but Make It Cheeky

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Foundations of Generative AI: Probabilities and Next-Token Prediction

You already know what a token is and how transformers move attention around like a drama queen. Now we ask the machine a slightly less dramatic but more useful question: what is the most likely next token given everything it's seen so far?

Hook: Imagine the model as a nervous fortune teller

You give it a sequence of tokens (remember tokenization from the previous module) and it whispers probabilities for every possible next token. It does not pick a single answer in its head and then pretend nothing else exists — it assigns a probability to each token, like a weather forecast that says 70% chance of sunshine, 20% rain, 10% meteorite. That distribution is the beating heart of generative models.

Why care? Because every generation decision — greedy, random sampling, top-k, top-p, or beam search — comes from this probability distribution. Mess with the distribution, and you change the model's personality: bland, creative, repetitive, or surprisingly poetic.

Quick recap: where probabilities come from (building on the transformer primer)

The transformer gives each candidate token a score called a logit via a final linear layer applied to the decoder's hidden state.
Those logits are converted into a probability distribution using the softmax function.
That distribution is the model saying, for each token t: p(t | context) = probability of t being the next token given everything before it.

This uses the tokenization you already know: the set of candidate tokens is the vocabulary created during tokenization.

The math you actually need (but friendly)

Softmax turns logits into probabilities. If logits are z1, z2, ..., zN then

softmax(zi) = exp(zi) / sum_j exp(zj)

Concrete tiny example:

logits = [2.0, 1.0, 0.1]
exp = [e^2.0, e^1.0, e^0.1] ≈ [7.39, 2.72, 1.11]
sum ≈ 11.22
probs ≈ [0.659, 0.243, 0.099]

So the first token gets about 66% probability, second 24%, third 10%.

Temperature changes the mood. Divide logits by temperature T before softmax:

modified_logits = logits / T

T < 1 -> sharper distribution, more confident, more greedy
T > 1 -> flatter distribution, more diverse, more creative

Think of temperature as the volume knob on the model's indecisiveness.

Sampling methods: how we turn probabilities into actual tokens

Why different methods? Because probability alone doesn't dictate how we should convert that distribution into a token. Different strategies trade off creativity, coherence, and computational cost.

Greedy
- Pick the argmax token every time.
- Pros: deterministic, fast. Cons: boring and prone to loops.
Temperature sampling
- Sample from softmax(logits / T).
- Pros: easy, tunable. Cons: can still pick very unlikely tokens when T is high.
Top-k sampling
- Keep only the k highest-probability tokens, renormalize, sample.
- Pros: removes extremely low-probability noise. Cons: fixed k may be awkward for long tails.
Top-p (nucleus) sampling
- Keep the smallest set of tokens whose cumulative probability ≥ p, renormalize, sample.
- Pros: adaptive; keeps enough tokens to reach desired mass. Cons: slightly more compute.
Beam search
- Keep multiple candidate sequences (beams) and expand them, picking highest-scoring final sequences.
- Pros: better for tasks requiring global coherence (translation). Cons: can be too conservative and produce generic outputs.

Table: quick comparison

Method	Creativity	Determinism	When to use
Greedy	Low	Deterministic	Short answers, strict constraints
Temperature	Medium to high	Stochastic	Story generation with tunable diversity
Top-k	Medium	Stochastic	Remove tiny tails, faster sampling
Top-p	Medium-high	Stochastic	Best general-purpose sampler
Beam	Low-medium	Deterministic-ish	Translation, summarization when quality matters

Loss, evaluation, and the unpleasant truth: cross entropy and perplexity

During training the model is optimized to assign high probability to the true next token. The standard loss is cross-entropy:

loss = -sum over tokens (one_hot_true * log(prob_true))

Perplexity is an exponentiated average loss, roughly "how surprised the model is on average":

perplexity = exp(average negative log-likelihood)

Lower perplexity = better next-token prediction. But remember: low perplexity doesn't mean good creative writing. It means the model is good at matching training distribution.

Common misconceptions

People say the model "knows" the next token. Correction: the model computes a distribution of plausibilities. It does not "choose" until sampling or decoding happens.
High confidence (peaked distribution) is not the same as correctness. Models can be confidently wrong.
Temperature fixes everything. Not true. Temperature reshapes distribution but doesn't fix missing knowledge or biases in training data.

Tiny thought experiment

You ask a model to finish the sentence: "The secret ingredient is"

If the model has seen tons of recipe text, probabilities will favor foods like "salt" or "love" depending on corpus.
Lower T yields the safe, likely completion: "salt".
Higher T might produce cheeky completions like "time travel".

Ask yourself: what does the selection imply about the training data? You're seeing the training distribution leaking into outputs.

Practical tips for prompt engineers

If you want reliability and repeatability, start with greedy or low-T sampling.
For creative tasks, use top-p ~ 0.9 with T in 0.7-1.0 range as a default.
Watch for repetition. If the model loops, decrease T or use top-k/top-p.
If you need the model to produce something specific, bias logits via prompting or logit adjustments rather than hoping sampling will cooperate.

Pro tip: small changes in prompt often give bigger changes in distribution than fiddling with temperature by 0.1. The prompt writes the prior; decoding policies shape the noise.

Closing: Why this matters for prompt engineering

Next-token probabilities are the language model's raw intentions. As a prompt engineer, you are designing the context and picking the decoding policy that turns those intentions into outputs. Understand softmax, temperature, and sampling strategies, and you go from being a lucky guesser to a principled creator of outputs.

Key takeaways:

The transformer produces logits; softmax turns them into probabilities over the tokenized vocabulary.
Sampling strategy plus temperature determine creativity vs reliability.
Cross-entropy and perplexity tell you how good the model is at predicting tokens, not at being interesting.

Want to practice? Try a tiny experiment: take a prompt, get logits for next token, manually compute softmax, then sample with different temperatures and top-p values. Seeing the numbers and outcomes side by side will make this all click.

Version next: we can dig into logit manipulation, safety filters, or how to condition distributions via control tokens. Which chaotic rabbit hole would you like to jump into next?

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics