Fundamentals of Machine Learning
Understand the core principles of machine learning, a subset of AI, and how it enables computers to learn from data.
Content
Reinforcement Learning
Versions:
Watch & Learn
AI-discovered learning video
Reinforcement Learning — The Tiny Tyrant That Teaches Itself
Imagine training a dog that only learns by getting treats after doing a trick, but the dog also has to decide which tricks to invent. Welcome to Reinforcement Learning (RL).
You already met Supervised Learning (the teacher gives you labeled flashcards) and Unsupervised Learning (you stare at a bunch of photos and try to find patterns). RL is the rowdy cousin who learns by doing, getting feedback from the world, and improvising under pressure.
Why RL matters (and why you should care)
RL is the framework for problems where decisions cause consequences over time. This is huge — games, robotics, recommendation systems that adapt, automated trading, even optimizing traffic lights. Unlike supervised learning, RL deals directly with sequences of actions and delayed rewards.
Quick reminder: Supervised learning solves mapping from inputs to labels. Unsupervised learning finds structure. RL optimizes behavior—a sequence of choices—to maximize cumulative rewards.
The RL cast of characters (aka core components)
- Agent: the learner or decision maker (our dog, robot, or algorithm).
- Environment: everything outside the agent; the simulator, the game, the real world.
- State (s): a representation of the current situation the agent observes.
- Action (a): a choice the agent can make.
- Reward (r): scalar feedback from the environment after an action. Think of it as applause or a slap.
- Policy (π): the agent's strategy, mapping states to action probabilities or actions.
- Value function (V or Q): estimates of expected cumulative reward from states or state-action pairs.
- Model (optional): a predictor of environment dynamics (used in model-based RL).
The magic in RL is not instant feedback; it is credit assignment — discovering which actions led to which future rewards.
The learning loop — simple pseudocode
Initialize policy π and value estimates (if any)
loop:
observe state s
choose action a ~ π(.|s) (explore or exploit)
execute a, receive reward r and next state s'
update policy / value estimates using (s, a, r, s')
s <- s'
end loop
Two crucial tensions in that loop:
- Exploration vs Exploitation — Do you try something new or stick with the winning move?
- Short-term vs Long-term — Some actions give immediate reward, others pay off later.
Families of algorithms (the neighborhoods)
| Type | What it optimizes | Pros | Cons |
|---|---|---|---|
| Value-based (e.g., Q-learning, DQN) | Estimates Q(s,a) and picks best action | Often stable, simple policy derivation | Can struggle with continuous actions |
| Policy-based (e.g., REINFORCE) | Directly optimizes policy πθ | Handles continuous actions naturally | High variance, often needs lots of samples |
| Actor-Critic | Actor = policy, Critic = value estimate | Best of both worlds: lower variance | More complex, tuning needed |
| Model-based | Learns environment model and plans | Sample-efficient, interpretable | Model errors can hurt performance |
A taste of an update rule
Q-learning update (tabular):
Q(s,a) <- Q(s,a) + α [r + γ max_{a'} Q(s',a') - Q(s,a)]
- α is learning rate, γ is discount factor. This is RL taking a sip of the future and adjusting its expectations.
Historical and cultural context (because nerd history is fun)
- RL ideas trace to psychology and behaviorism (reward and punishment).
- The multi-armed bandit problem formalized exploration vs exploitation.
- Key ML milestones: Temporal Difference (TD) learning and Sutton & Barto formalizing RL theory.
- Deep RL resurgence began when DeepMind combined deep neural nets with Q-learning (DQN) and crushed Atari games, followed by AlphaGo defeating human Go champions.
Real-world examples — not just games
- Games: Atari, Go, StarCraft — RL showed superhuman performance, which made headlines.
- Robotics: learning locomotion or manipulation policies via trial-and-error in simulation then transferring to the real world.
- Recommendation systems: learn to maximize long-term engagement rather than immediate clicks.
- Healthcare: treatment policies that consider patient outcomes over time (caution: ethical and safety-critical).
- Operations: inventory control, scheduling, and traffic signal optimization.
Ask yourself: if decisions affect future states and rewards, could RL help? If yes, consider the cost of exploration and safety first.
Contrasting perspectives and pitfalls
- Model-free vs Model-based: Model-free methods skip learning a world model and directly learn value or policy — simpler but sample-hungry. Model-based learning is sample-efficient but brittle if your model is wrong.
- Reward specification: design the wrong reward and your agent will find creative ways to maximize it (reward hacking).
- Sample efficiency: RL often needs massive interaction data; in the real world, that can be expensive or dangerous.
- Safety and ethics: agents may take harmful shortcuts; safety constraints must be integrated.
Misunderstanding people make: "RL just learns like humans do." Not quite. Humans bring huge priors, social knowledge, language, and safety instincts. Most RL agents are blank slates and will exploit loopholes unless guided.
Quick quiz for your brain (5 seconds each)
- Is supervised learning enough to teach a robot to plan across time? Why not?
- What's the tradeoff in exploration vs exploitation? Give an example from daily life.
- Name one reason you might prefer model-based RL over model-free in a healthcare setting.
(Answers: no — supervised learning maps inputs to labels but not sequences of decisions; exploration trades short-term reward for information; model-based may need fewer risky real-world trials.)
Final roundup — TL;DR and cheat sheet
- RL = learning by interaction with an environment to maximize cumulative reward.
- Core pieces: agent, environment, state, action, reward, policy, value.
- Main challenges: exploration vs exploitation, credit assignment, sample efficiency, and reward design.
- Big algorithm families: value-based, policy-based, actor-critic, model-based.
Parting thought: RL is a powerful way to get systems to act, not just predict. But with power comes responsibility — designing rewards and constraints thoughtfully is the difference between a helpful assistant and an opportunistic exploit machine.
If you enjoyed this, next up you can dive into a hands-on tutorial: implement tabular Q-learning on a small grid world and watch the agent learn a path home. That's where the theoretical sparks turn into satisfying, sometimes infuriating, actual learning.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!