jypi
  • Explore
ChatWays to LearnMind mapAbout

jypi

  • About Us
  • Our Mission
  • Team
  • Careers

Resources

  • Ways to Learn
  • Mind map
  • Blog
  • Help Center
  • Community Guidelines
  • Contributor Guide

Legal

  • Terms of Service
  • Privacy Policy
  • Cookie Policy
  • Content Policy

Connect

  • Twitter
  • Discord
  • Instagram
  • Contact Us
jypi

© 2026 jypi. All rights reserved.

Generative AI: Prompt Engineering Basics
Chapters

1Foundations of Generative AI

What Is Generative AIAI vs ML vs Deep LearningTransformer Architecture PrimerTokens and TokenizationProbabilities and Next-Token PredictionTemperature and Top-p SamplingContext Window and LimitsPrompt–Response LoopSystem, Developer, and User MessagesCapabilities and LimitationsHallucinations and UncertaintyDeterminism vs StochasticitySafety Layers and ModerationEvaluation Mindset from Day OneUseful Mental Models of LLMs

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Foundations of Generative AI

Foundations of Generative AI

21725 views

Establish how modern LLMs generate text, the role of tokens and probabilities, and the constraints that shape prompt behavior.

Content

5 of 15

Probabilities and Next-Token Prediction

Probabilities but Make It Cheeky
1293 views
beginner
humorous
visual
science
gpt-5-mini
1293 views

Versions:

Probabilities but Make It Cheeky

Watch & Learn

AI-discovered learning video

Sign in to watch the learning video for this topic.

Sign inSign up free

Start learning for free

Sign up to save progress, unlock study materials, and track your learning.

  • Bookmark content and pick up later
  • AI-generated study materials
  • Flashcards, timelines, and more
  • Progress tracking and certificates

Free to join · No credit card required

Foundations of Generative AI: Probabilities and Next-Token Prediction

You already know what a token is and how transformers move attention around like a drama queen. Now we ask the machine a slightly less dramatic but more useful question: what is the most likely next token given everything it's seen so far?


Hook: Imagine the model as a nervous fortune teller

You give it a sequence of tokens (remember tokenization from the previous module) and it whispers probabilities for every possible next token. It does not pick a single answer in its head and then pretend nothing else exists — it assigns a probability to each token, like a weather forecast that says 70% chance of sunshine, 20% rain, 10% meteorite. That distribution is the beating heart of generative models.

Why care? Because every generation decision — greedy, random sampling, top-k, top-p, or beam search — comes from this probability distribution. Mess with the distribution, and you change the model's personality: bland, creative, repetitive, or surprisingly poetic.


Quick recap: where probabilities come from (building on the transformer primer)

  • The transformer gives each candidate token a score called a logit via a final linear layer applied to the decoder's hidden state.
  • Those logits are converted into a probability distribution using the softmax function.
  • That distribution is the model saying, for each token t: p(t | context) = probability of t being the next token given everything before it.

This uses the tokenization you already know: the set of candidate tokens is the vocabulary created during tokenization.


The math you actually need (but friendly)

Softmax turns logits into probabilities. If logits are z1, z2, ..., zN then

softmax(zi) = exp(zi) / sum_j exp(zj)

Concrete tiny example:

logits = [2.0, 1.0, 0.1]
exp = [e^2.0, e^1.0, e^0.1] ≈ [7.39, 2.72, 1.11]
sum ≈ 11.22
probs ≈ [0.659, 0.243, 0.099]

So the first token gets about 66% probability, second 24%, third 10%.

Temperature changes the mood. Divide logits by temperature T before softmax:

modified_logits = logits / T
  • T < 1 -> sharper distribution, more confident, more greedy
  • T > 1 -> flatter distribution, more diverse, more creative

Think of temperature as the volume knob on the model's indecisiveness.


Sampling methods: how we turn probabilities into actual tokens

Why different methods? Because probability alone doesn't dictate how we should convert that distribution into a token. Different strategies trade off creativity, coherence, and computational cost.

  1. Greedy

    • Pick the argmax token every time.
    • Pros: deterministic, fast. Cons: boring and prone to loops.
  2. Temperature sampling

    • Sample from softmax(logits / T).
    • Pros: easy, tunable. Cons: can still pick very unlikely tokens when T is high.
  3. Top-k sampling

    • Keep only the k highest-probability tokens, renormalize, sample.
    • Pros: removes extremely low-probability noise. Cons: fixed k may be awkward for long tails.
  4. Top-p (nucleus) sampling

    • Keep the smallest set of tokens whose cumulative probability ≥ p, renormalize, sample.
    • Pros: adaptive; keeps enough tokens to reach desired mass. Cons: slightly more compute.
  5. Beam search

    • Keep multiple candidate sequences (beams) and expand them, picking highest-scoring final sequences.
    • Pros: better for tasks requiring global coherence (translation). Cons: can be too conservative and produce generic outputs.

Table: quick comparison

Method Creativity Determinism When to use
Greedy Low Deterministic Short answers, strict constraints
Temperature Medium to high Stochastic Story generation with tunable diversity
Top-k Medium Stochastic Remove tiny tails, faster sampling
Top-p Medium-high Stochastic Best general-purpose sampler
Beam Low-medium Deterministic-ish Translation, summarization when quality matters

Loss, evaluation, and the unpleasant truth: cross entropy and perplexity

During training the model is optimized to assign high probability to the true next token. The standard loss is cross-entropy:

loss = -sum over tokens (one_hot_true * log(prob_true))

Perplexity is an exponentiated average loss, roughly "how surprised the model is on average":

perplexity = exp(average negative log-likelihood)

Lower perplexity = better next-token prediction. But remember: low perplexity doesn't mean good creative writing. It means the model is good at matching training distribution.


Common misconceptions

  • People say the model "knows" the next token. Correction: the model computes a distribution of plausibilities. It does not "choose" until sampling or decoding happens.
  • High confidence (peaked distribution) is not the same as correctness. Models can be confidently wrong.
  • Temperature fixes everything. Not true. Temperature reshapes distribution but doesn't fix missing knowledge or biases in training data.

Tiny thought experiment

You ask a model to finish the sentence: "The secret ingredient is"

  • If the model has seen tons of recipe text, probabilities will favor foods like "salt" or "love" depending on corpus.
  • Lower T yields the safe, likely completion: "salt".
  • Higher T might produce cheeky completions like "time travel".

Ask yourself: what does the selection imply about the training data? You're seeing the training distribution leaking into outputs.


Practical tips for prompt engineers

  • If you want reliability and repeatability, start with greedy or low-T sampling.
  • For creative tasks, use top-p ~ 0.9 with T in 0.7-1.0 range as a default.
  • Watch for repetition. If the model loops, decrease T or use top-k/top-p.
  • If you need the model to produce something specific, bias logits via prompting or logit adjustments rather than hoping sampling will cooperate.

Pro tip: small changes in prompt often give bigger changes in distribution than fiddling with temperature by 0.1. The prompt writes the prior; decoding policies shape the noise.


Closing: Why this matters for prompt engineering

Next-token probabilities are the language model's raw intentions. As a prompt engineer, you are designing the context and picking the decoding policy that turns those intentions into outputs. Understand softmax, temperature, and sampling strategies, and you go from being a lucky guesser to a principled creator of outputs.

Key takeaways:

  • The transformer produces logits; softmax turns them into probabilities over the tokenized vocabulary.
  • Sampling strategy plus temperature determine creativity vs reliability.
  • Cross-entropy and perplexity tell you how good the model is at predicting tokens, not at being interesting.

Want to practice? Try a tiny experiment: take a prompt, get logits for next token, manually compute softmax, then sample with different temperatures and top-p values. Seeing the numbers and outcomes side by side will make this all click.

Version next: we can dig into logit manipulation, safety filters, or how to condition distributions via control tokens. Which chaotic rabbit hole would you like to jump into next?

Flashcards
Mind Map
Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Sign up now to study with flashcards, practice questions, and more — and track your progress on this topic.

Study with flashcards, timelines, and more
Earn certificates for completed courses
Bookmark content for later reference
Track your progress across all topics