Foundations of Generative AI
Establish how modern LLMs generate text, the role of tokens and probabilities, and the constraints that shape prompt behavior.
Content
Tokens and Tokenization
Versions:
Watch & Learn
AI-discovered learning video
Tokens and Tokenization — The Tiny Building Blocks That Run Large Models
"If transformers are the brains, tokens are the neurons — tiny, weird, and absolutely essential."
You just read about transformer internals and where deep learning sits in the stack (shout-out to the previous modules). Now it’s time to zoom in even closer: how do words, punctuation, and even emojis become something a model can actually compute with? Welcome to the gritty little world of tokens and tokenization.
Hook: Imagine building IKEA furniture without screws
You get a box of parts, but nothing is labeled. Some things look like planks, some like bolts — and you call customer support. That’s a model without tokenization. Tokens are the screws and bolts that let the machine assemble language into meaning.
Why this matters: tokenization determines how input is chopped, how many tokens your prompt costs, how the model generalizes to rare words, and how outputs can get weirdly split. For prompt engineering, tokenization is a silent contract between you and the model.
What is a token? What is tokenization?
- Token: a discrete unit the model uses as input/output. Could be a whole word, part of a word, a punctuation mark, or even a byte sequence.
- Tokenization: the process that maps raw text (human language) into a sequence of tokens.
Big idea: tokens are not the same as words. The word "unbelievable" might be one token, two, or five tokens depending on the tokenizer. That affects both cost (token limits) and performance.
Common tokenization strategies (simple cheat-sheet)
| Type | What it does | Pros | Cons |
|---|---|---|---|
| Character-level | Splits into individual characters | No OOV (out-of-vocab), simple | Long sequences, inefficient |
| Word-level | Splits on whitespace/punctuation | Intuitive, short tokens | Huge vocab, fails on rare words/languages |
| Subword (BPE, WordPiece, Unigram) | Breaks words into common subparts | Compact vocab, handles rare words | Can split inside morphemes, non-intuitive breaks |
| Byte-level | Encodes bytes directly (e.g., UTF-8) | Language-agnostic, robust | Less human-readable tokens |
Quick explainer of subword algorithms
- BPE (Byte-Pair Encoding): start with chars, iteratively merge most frequent pairs into new tokens. Good balance of vocab size vs coverage.
- WordPiece: similar to BPE but optimized differently (used in some BERT models).
- Unigram: probabilistic, chooses token set that maximizes likelihood under a unigram model.
- Byte-level BPE: tokenization over raw bytes so it can represent any unicode without special handling (used by some GPT models).
Real-world analogies (because metaphors stick)
- Tokens are LEGO bricks. Words can be big bricks or tiny bricks. Subword tokenizers give you flexible brick sizes so you can build rare or complex words without an infinite toy box.
- Tokenization is like cutting a loaf of bread. Too thick: you can’t butter evenly. Too thin: you’re chewing forever.
What tokenization looks like (examples)
Input: I'm learning to code 🤖 — and I love it!
Possible tokens (subword/BPE-style): ['I', "'m", ' learning', ' to', ' code', ' ', '916', ' —', ' and', ' I', ' love', ' it', '!']
Token count: ~13 (varies by tokenizer)
# Pseudocode example (Python-like)
ids = tokenizer.encode('I\'m learning to code 🤖 — and I love it!')
tokens = tokenizer.decode_tokens(ids)
print(tokens)
# -> ['I', "'m", ' learning', ' to', ' code', ' 🤖', ' —', ' and', ' I', ' love', ' it', '!']
Notice how punctuation, emoji, and contractions can be split into separate tokens. That affects generation: if the model learned to place an apostrophe token before conjugation, splitting can change fluency.
Tokenization and prompt engineering — the tricks you actually need
- Token budget matters: model limits are in tokens, not characters. A dense Unicode string can cost more tokens than it looks like.
- Watch out for surprising splits: long compound words or rare proper nouns might become many tokens. That eats your budget and can harm performance.
- Special tokens: some models use special tokens (e.g., or ) for system signals. Know them — they might be counted or reserved.
- Whitespace is meaningful: many tokenizers treat leading spaces differently, which can change completions. For example, 'hello' vs ' hello' might tokenize differently.
- Language and script effects: tokenizers tuned on English can perform worse on languages with different morphology (e.g., agglutinative languages) unless byte-level or multilingual tokenizers are used.
Mini case study: Why a single character can explode token count
Consider code snippets or hex dumps. A JSON blob with lots of short keys can tokenize into many small subwords or bytes. That translates to higher cost and hit to latency. When building prompts that include long data, think about compression or summarization before sending.
Diagnostic moves (How to inspect tokenization)
- Always run your tokenizer on representative prompts and count tokens before sending to the model.
- Use tokenizer.debug/encode methods in SDKs to see how text maps to tokens.
- Try alternate phrasings to reduce token count: prefer 'cannot' vs 'can not'? Sometimes merging reduces tokens.
Quick checklist:
- Did I include unexpected whitespace or hidden characters? (copy-paste gremlins)
- Are there many rare names or emojis? They cost tokens.
- Do I need byte-level safety for non-Latin scripts?
Expert take: "Tokenization is not just an implementation detail. It's a design decision that shapes model behavior, costs, and fairness across languages."
Closing — TL;DR and Actionable Takeaways
- Tokens are the atoms of language models; tokenization is the chemistry that makes atoms usable.
- Prefer subword/byte-level tokenizers for modern models: they balance vocab size and coverage.
- Always inspect tokenization for your prompts — it can save you money and improve results.
- Be mindful of special tokens, whitespace sensitivity, and multilingual quirks.
Parting challenge: take your favorite prompt and run it through the tokenizer. How many tokens does it produce? Where are the splits? Tweak the text to halve the token count. That tiny exercise will instantly make you a sharper prompt engineer.
Version note: This builds on the transformer internals you saw earlier (attention needs sequence indices, and tokens are the sequence). Next up: how token embeddings convert tokens into vectors the transformer can actually reason about.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!