Generative AI: Prompt Engineering Basics

Chapters

1Foundations of Generative AI

2LLM Behavior and Capabilities

3Core Principles of Prompt Engineering

4Writing Clear, Actionable Instructions

5Roles, Personas, and System Prompts

6Supplying Context and Grounding

7Examples: Zero-, One-, and Few-Shot

8Structuring Outputs and Formats

9Reasoning and Decomposition Techniques

10Iteration, Testing, and Prompt Debugging

11Evaluation, Metrics, and Quality Control

Human Evaluation Rubrics LLM-as-Judge Techniques Objective vs Subjective Metrics Accuracy, Fluency, and Coverage Safety and Harms Assessment Cost, Latency, and Quality Tradeoffs Acceptance Thresholds Inter-Rater Reliability Sampling and Test Sets Calibration and Score Normalization Prompt Scorecards Dashboards and Monitoring Continuous Evaluation Loops Drift and Degradation Detection Closing the Feedback Loop

12Safety, Ethics, and Risk Mitigation

13Tools, Functions, and Agentic Workflows

14Retrieval-Augmented Generation (RAG)

15Multimodal and Advanced Prompt Patterns

Courses/Generative AI: Prompt Engineering Basics/Evaluation, Metrics, and Quality Control

Evaluation, Metrics, and Quality Control

19421 views

Measure output quality with human and automated methods, track performance, and close the loop with monitoring.

Content

6 of 15

Cost, Latency, and Quality Tradeoffs

The No-Chill Tradeoffs Guide

779 views

intermediate

humorous

computer science

education theory

gpt-5-mini

779 views

Versions:

The No-Chill Tradeoffs Guide

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Cost, Latency, and Quality Tradeoffs — The Tricky Three-Body Problem of Prompt Engineering

"You can have it cheap, fast, or perfect. Pick two — but also try not to pick the one that explodes."

We already covered accuracy, fluency, and coverage, plus safety and harms, and you learned how to iterate, test, and red-team prompts. Now we get to the ruthless practicality layer: when your prompt works on paper, but the real world demands budgets, deadlines, and user patience. This lesson helps you make principled tradeoffs between cost, latency, and quality, and gives you experiments and patterns to move confidently between them.

Why this matters (quick recap)

From prior modules: you know how to measure accuracy, fluency, and coverage, and how to assess safety and harms.
From iteration and debugging: you have a workflow for experiments, versioning, and red-teaming.

Now think of tradeoffs like tuning a three-way seesaw. Push on quality and cost goes up, or push on low latency and quality can fall. Your job is to decide which levers to pull, when, and how to measure the change so your choice is defensible.

The metrics you need to log (and why)

Cost
- Tokens per request: prompt_tokens + completion_tokens
- Price per 1k tokens (from your provider)
- Cost per request = (tokens / 1000) * price_per_1k
- Monthly cost estimate = cost_per_request * expected_requests_per_month
Latency
- p50, p95, p99 response times (end-to-end, including network)
- Cold start vs warm response
- Breakdown: network + model inference + post-processing
Quality
- Task-specific metrics (accuracy, BLEU/ROUGE when applicable, exact match)
- Human-rated fluency, relevance, safety checks
- Coverage and failure-mode counts from red-team tests
Operational
- Throughput (requests/sec)
- Error rates and retries

Pro tip: Log tokens and latency per request. These are the smallest atoms you will use to trade off cost and speed.

Simple math example

Suppose model A charges 0.03 per 1k tokens and model B charges 0.003 per 1k tokens. Typical request uses 500 tokens total.

Cost per request, A: (500 / 1000) * 0.03 = 0.015
Cost per request, B: (500 / 1000) * 0.003 = 0.0015

If model A yields 95% task accuracy and model B yields 85%, ask: is the extra 10% accuracy worth 10x cost? That depends on business impact.

Common tradeoff strategies (patterns you can use)

1) Cascade or tiered pipelines

First pass: cheap, fast model or filters (small model, heuristics).
Rerank or escalate: only expensive model if cheap model is uncertain.

When to use: high throughput with occasional need for high fidelity.

Example: user question -> small model generates candidates -> classifier estimates confidence -> if confidence < threshold -> call big model for final answer.

2) Reranking instead of generating

Use an inexpensive candidate generator + expensive reranker (or vice versa).
Reranker can be smaller/larger depending on latency tolerance.

When to use: creative outputs where top-n diversity matters.

3) Distillation and fine-tuning

Train a smaller model on outputs from a larger one to capture behavior cheaply.
Adds upfront cost but reduces per-request cost and latency long-term.

When to use: stable task with many requests and acceptable initial investment.

4) Caching and memoization

Cache complete answers or partial computations for repeated prompts.
Use normalization and keys for prompt templates.

When to use: high repetition scenarios (FAQ-like).

5) Streaming and early stopping

Stream partial answers to users as tokens arrive; stop generation when confident.
Early stopping heuristics: token-level confidence or heuristic termination rules.

When to use: user-experience-focused applications where perceived latency matters.

6) Prompt engineering to reduce tokens

Compress context: summarize long histories, remove low-value tokens, use slot filling.
Use few-shot wisely: sometimes 1-3 examples provide most benefit; beyond that you pay heavily in tokens.

When to use: long conversations and chain-of-thought contexts.

7) Parallelization and batching

Batch multiple requests to the model if supported; parallelize independent tasks.

When to use: backend jobs and asynchronous workflows.

Decision framework: pick your strategy

Ask these questions in order:

Is latency user-perceived and critical? If yes -> prioritize small models, streaming, caching.
Is quality impact directly measurable in revenue or safety? If yes -> prioritize larger models, human review, stricter testing.
What is request volume? High volume favors upfront investments like distillation and caching.
What are failure costs (safety/regulatory)? High failure cost favors conservative pipelines with reranking and verification.

Experiment recipes (build on your iteration workflow)

A/B test model swaps with controlled traffic splits.
Log tokens, latency, and quality metrics per variant. Plot cost per successful outcome.
Red-team the cheaper cascaded path to ensure safety thresholds still met.
Run sensitivity analysis: vary prompt length, example count, and temperature. Track marginal cost vs marginal accuracy.

Example experiment: 10k requests split across A (large model) and B (cheap cascade). Measure p95 latency, cost per correct answer, and unsafe output rate. Use statistically significant tests to choose winner.

Quick comparison table

Goal priority	Typical approach	Pros	Cons
Minimize cost	Small model, caching, distillation	Cheap, scalable	Lower top-tier quality
Minimize latency	Small model, streaming, short prompts	Fast UX	May sacrifice coverage
Maximize quality	Large model, human review, multi-stage QA	Best accuracy and safety	Expensive, slower

Final checklist before deployment

Are token counts controlled and logged?
Did you measure p95 and p99, not just average?
Is there a fallback for model failures and safety violations?
Have you run cost-vs-quality experiments and documented results?
Do you have a plan for model versioning and rollbacks?

Closing note

Tradeoffs are not moral failures — they are constraints. The artistry of prompt engineering is learning to turn constraints into levered advantages. Build small experiments, measure the real costs (money and human attention), and design flows that escalate only when necessary. Make the machine do the cheap grunt work and call in the heavy artillery only when it matters.

If you remember one thing: measure everything that moves. When you can quantify cost, latency, and quality on the same axis, tradeoffs stop being guesswork and start being strategy.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics