Reasoning and Decomposition Techniques
Elicit better thinking with outline-first strategies, hypothesis testing, and verification-first prompting.
Content
Hypothesis Generation
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Hypothesis Generation — The Detective Work of Prompt Engineering
"You don't need a magic model. You need better guesses."
If Self-Ask and Subquestioning taught you how to interrogate a problem like a polite but relentless lawyer, and Rationale-Lite gave you the economical shorthand for why an answer made sense, then Hypothesis Generation is the moment you become the detective: you generate plausible explanations, rank them, and design small experiments (prompts) to see which one survives interrogation.
This lesson builds on Structuring Outputs and Formats: once you generate hypotheses, you'll want to express them in strict schemas so your model's answers can be parsed, tested, and scored automatically.
Why hypothesis generation matters (and why humans still beat magic)
- Models can spit plausible-sounding answers. Hypotheses force us to consider alternatives instead of accepting the first shiny thing.
- Hypotheses make reasoning testable. Instead of "the model said X", you get "Hypothesis A predicts outcome Y; run the test; measure Z."
- Hypothesis-driven prompts reduce confirmation bias: they make your prompt a little scientific method instead of a wish.
Think of it like debugging code: you don't randomly change lines hoping for the best. You form hypotheses about what might be broken, then run targeted tests. Prompt engineering is the same, but with words.
Types of hypotheses you'll use (quick table)
| Type | What it looks like | When to use |
|---|---|---|
| Causal | 'If prompt lacks context, model hallucinates' | Model gives wrong facts or invents sources |
| Correlational | 'Short prompts tend to return generic answers' | You want to decide prompt length tradeoffs |
| Heuristic | 'Asking for steps reduces missing substeps' | Designing task decomposition prompts |
| Edge-case | 'Dates near DST confuse the model' | Robustness and QA |
A practical workflow: From observation to tested hypothesis
- Observe
- Gather failing examples or behaviors (low precision, hallucination, missing steps).
- Generate 5 candidate hypotheses (fast, sloppy, creative). Use Rationale-Lite to attach a 1-2 line reason for each.
- Prioritize by plausibility and measurability.
- Design micro-tests (prompts + output schema) to distinguish hypotheses.
- Run tests on batches, parse outputs, score by metrics.
- Iterate: refine hypotheses or decompose them into subhypotheses using Self-Ask.
Example: model keeps inventing sources
- Observation: answers include fake citations.
- Hypotheses:
- H1: The prompt doesn't request source format (causal).
- H2: The model hallucinates when the knowledge cutoff isn't specified (heuristic).
- H3: Asking for 'no made-up sources' is ambiguous and ignored (correlational).
- H4: Short prompts are missing a constraint token (edge-case).
- Tests: design 4 prompts each targeting one hypothesis, keep response schema strict (see below).
Prompt templates for hypothesis generation
Quick generator: "List 5 hypotheses for why the model [observed behavior]. For each, give a 1-sentence rationale and a 1-line test you can run."
Example prompt you can drop into a model to brainstorm hypotheses:
You saw that model X frequently invents references. Generate 5 possible hypotheses explaining this. For each hypothesis include:
- Hypothesis: short sentence
- Rationale (rationale-lite): 1 sentence
- Test Prompt: one short prompt to run that would confirm or disconfirm this hypothesis
Return as a JSON array of objects.
Note: tie this to a JSON schema (below) for easy parsing and scoring.
Output schema: make hypotheses machine-actionable
You already learned to enforce structure. Here’s a minimal schema you can use when asking the model to generate hypotheses:
[{
'id': 'H1',
'hypothesis': 'string',
'rationale_lite': 'string',
'test_prompt': 'string',
'expected_outcome': 'string',
'priority': 'low|medium|high'
}]
Using a schema means you can automatically run the test_prompt, parse the result, and compute whether expected_outcome occurred. This closes the loop from ideation to evaluation.
How to design tests that actually distinguish hypotheses
- Keep tests minimal: change only the variable implicated by the hypothesis.
- Use structured outputs so automated checks are possible. For example, instruct model to return JSON with fields 'sources' (array) and 'confidence' (0-1).
- Use control prompts: run the same base prompt with and without the hypothesized change.
Example micro-test (pseudo):
Base prompt: Explain topic T and provide up to 3 sources.
Test 1 (H1): Add 'Provide only real sources; if none known, answer "no sources"'.
Compare counts of fabricated sources across runs.
Decomposition & Self-Ask: when a hypothesis is too big
If a hypothesis is broad ("the model hallucinated because of prompt ambiguity"), decompose it:
- Use Self-Ask to list subquestions that must be true for the hypothesis to hold.
- Convert subquestions into test prompts.
Example subquestions:
- Did the prompt include an explicit phrase forbidding invention?
- Did the model list any sources with URL patterns?
- Was the question time-bounded (post-cutoff)?
Answer each subquestion with short, structured outputs — Rationale-Lite works excellently here.
Common pitfalls and how to avoid them
- Confirmation bias: don’t just craft tests that confirm your favorite hypothesis. Design discriminative tests.
- Overgeneration: many hypotheses are useless. Use priority scoring (impact x ease) to triage.
- Vagueness: 'because the model is dumb' is not a hypothesis. Make it testable.
- Schema drift: if the model keeps returning malformed JSON, include schema enforcement and a validator step.
A tiny pseudocode experiment runner
for each hypothesis in hypotheses:
run test_prompt N times
parse outputs using schema
compute metric compare to expected_outcome
record pass_rate
rank hypotheses by pass_rate vs expected
This is your experimental loop. Repeat, refine, and don't be afraid to throw away hypotheses that don't survive.
Closing: Why this matters for prompt engineering
Hypothesis generation turns prompt work from artisanal guesswork into a repeatable method. When combined with Rationale-Lite (quick why notes), Self-Ask (decompose tests), and strict output schemas (structuring outputs and formats), you get a robust pipeline:
- Brainstorm plausible causes
- Attach lightweight rationales
- Design structured, testable prompts
- Run, parse, score, and iterate
Final thought: models will always be probabilistic storytellers. Your job is to be a skeptical editor — propose competing stories, choose the most falsifiable, and let data (and the model's behavior) decide. That’s where progress lives.
Key takeaways
- Generate multiple, testable hypotheses, not just one favored explanation.
- Use Rationale-Lite so each hypothesis carries a compact justification.
- Make tests minimal and outputs structured; automate parsing and scoring.
- Decompose big hypotheses with Self-Ask into concrete subtests.
Go forth like a charmingly cranky detective: make bold guesses, demand proof, and never trust a source without a JSON schema.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!