Foundations of Probability and Graphical Models
Core probability concepts and the rationale for representing distributions with graphs.
Content
Probability axioms
Versions:
Watch & Learn
Probability Axioms: The Laws Your Graphs Secretly Obey
If your probabilities do not follow the axioms, a Bayesian network will find you and quietly refuse to factorize.
You want to build Bayesian networks and decision graphs that do not collapse like a bad soufflé? Cool. Then we start with the foundations: the probability axioms. These are the three simple rules that keep all of probability coherent, computable, and unchaotic. Think of them as constitutional law for uncertainty. Everything else — conditional probability, Bayes, independence, d-separation, the whole graphical drama — descends from here.
What Are We Even Axiomatizing?
We deal with a triple (Omega, F, P):
- Omega (the sample space): all possible outcomes. Roll a die: {1,2,3,4,5,6}. Toss coins forever: infinite sequences of H/T.
- F (a sigma-algebra): a collection of events (subsets of Omega) that is closed under complements and countable unions. Yes, countable. Your intuition might protest, but your integrals will thank you.
- P: F -> [0,1], a probability measure assigning numbers to events.
The Kolmogorov Axioms
1) Non-negativity: For any event A in F, P(A) >= 0.
2) Normalization: P(Omega) = 1.
3) Countable additivity: If A1, A2, ... are pairwise disjoint,
P(Union_i Ai) = Sum_i P(Ai).
That is it. Three lines. Yet they power everything from weather forecasts to your favorite recommender system insisting you want four nearly identical water bottles.
Axioms are not conclusions; they are the rules of the game. Play by them, or probability politely declines to exist.
Why These Axioms Matter for Graphical Models
Bayesian networks are probability distributions that factor according to a directed acyclic graph (DAG). To even talk about conditional distributions P(Xi | Parents(Xi)) or to prove that the product over nodes defines a valid joint distribution, you need a probability measure that behaves. Countable additivity ensures consistency across partitions and refinements; normalization keeps the joint from exploding; non-negativity means no negative beliefs lurking in your CPTs like gremlins.
If your numbers violate the axioms, the joint you build may double-count mass, fail to normalize, or contradict itself across equivalent partitions. The graph is a vibes-based visualization; the axioms are the math-based reality check.
Derived Properties You Will Use Constantly
The axioms are minimal. The following are not extra assumptions; they fall straight out of the axioms like corollaries sliding down a water slide.
P(empty set) = 0.
Proof sketch: Omega = Omega union empty set (disjoint), so 1 = P(Omega) = P(Omega) + P(empty) implies P(empty) = 0.Complement rule: P(A^c) = 1 - P(A).
Because A and A^c are disjoint and their union is Omega.Monotonicity: If A subset B, then P(A) <= P(B).
Because B = A union (B \ A) with disjoint parts.Finite additivity: P(A union B) = P(A) + P(B) - P(A intersect B).
Union bound (Boole's inequality): P(union_i Ai) <= sum_i P(Ai).
Inclusion-exclusion, but we stop early to get a quick-and-dirty bound.Continuity of measure:
- From below: If A1 subset A2 subset ... then P(union Ai) = lim P(Ai).
- From above: If A1 superset A2 superset ... and P(A1) finite (here it is 1), then P(intersection Ai) = lim P(Ai).
Handy Cheatsheet
| Identity | Use it when | One-liner memory |
|---|---|---|
| P(empty) = 0 | sanity check | there is no probability in the void |
| P(A^c) = 1 - P(A) | complements | the rest of the pie |
| P(A union B) = P(A)+P(B)-P(A B) | overlapping events | do not double-count |
| P(union Ai) <= sum P(Ai) | bounding a mess | pessimism is safe |
| A subset B => P(A) <= P(B) | nesting | bigger set, bigger chance |
Conditional Probability, Total Probability, Bayes: Not Axioms, But Close Friends
We define conditional probability whenever P(B) > 0:
P(A | B) = P(A intersect B) / P(B).
From there:
Law of total probability (partition B1,...,Bk):
P(A) = sum_j P(A | Bj) P(Bj).Bayes rule:
P(H | E) = [P(E | H) P(H)] / P(E).
These are definitions plus algebra; the axioms guarantee consistency. In Bayesian networks, CPTs are literally conditional probabilities that glue together into a valid joint via repeated use of the chain rule and normalization.
Bayes is the cool trick; the axioms are the reason the trick is legal.
Independence and Conditional Independence (Graphical Model Catnip)
- Independence: A independent B if P(A intersect B) = P(A) P(B).
- Conditional independence: A independent B given C if P(A intersect B | C) = P(A | C) P(B | C) whenever P(C) > 0.
These are not axioms. They are properties that may or may not hold in a given world. In a DAG, d-separation tells you which conditional independences must hold. Your CPTs plus the axioms make all the algebra behave; d-separation tells you where you can factor or simplify without lying.
Common confusion to retire today:
- Mutually exclusive vs independent: If A and B are mutually exclusive and both nontrivial, they are not independent (unless one has probability 0). Exclusivity kills intersections; independence scales them.
Examples That Will Live Rent-Free In Your Brain
Dice hygiene: Let A be event of rolling an even number, B be event of rolling 1,2,3. Then P(A) = 3/6, P(B) = 3/6. P(A union B) = 5/6. The inclusion-exclusion formula gives 3/6 + 3/6 - 1/6 = 5/6. The axioms make that identity compulsory.
Infinite patience: Keep flipping a fair coin until the first head. The event H occurs eventually. Define An = event that a head occurs by flip n. An increases to the event A = union An. By continuity from below:
P(A) = lim P(An) = lim (1 - 2^{-n}) = 1.
Yet the event of never seeing a head has probability 0 but is not impossible in an abstract sense; it is just null. Welcome to almost sure land.Union bound in the wild: You deploy 5 anomaly detectors, each with false positive rate 0.01, and assume nothing about dependence. The chance at least one pings is at most 0.05 by the union bound. If they are independent, it is actually 1 - 0.99^5 about 0.049. The bound is conservative but safe.
How Axioms Power Factorization in Bayesian Networks
For variables X1,...,Xn arranged in a DAG, the joint is
P(x1,...,xn) = product_i P(xi | parents(xi)).
Why does this integrate (or sum) to 1? Inductively, because each conditional is normalized for every parent configuration, and countable additivity ensures that when you sum over all values of a leaf variable you get P of the rest. Non-negativity is obvious. The normalization axiom at the root level propagates through marginalization. Without these axioms, the claim that the local CPTs synthesize a coherent global distribution would be hand-waving.
Also, inference relies on inclusion-exclusion style identities: marginalize (sum/integrate) over hidden variables, apply union-like expansions, and keep everything non-negative. Markov blankets work because conditional probabilities are normalized slices.
Why People Keep Misunderstanding This
Thinking probability 1 means guaranteed. In discrete worlds, sure. In continuous worlds, events with probability 1 can still fail in principle; they just fail on a set of measure 0. The axioms are built for both worlds.
Confusing additivity with independence. Additivity only applies to disjoint events. Independence is a multiplicative statement about intersections.
Ignoring countable additivity. Finite additivity looks tempting, until you meet infinite processes, limits, or continuous variables. Then countable additivity is the difference between math and mayhem.
Continuous probability needs countable additivity the way bridges need rivets. You do not notice it when it works; you only notice when it fails.
Quick Proof Sketches You Can Do On a Napkin
P(A union B) formula: A union B decomposes into three disjoint parts: A \ B, B \ A, and A intersect B. Add, and that yields the subtract-the-overlap formula.
Union bound: Induct or use indicators: 1{union Ai} <= sum 1{Ai}. Take expectations on both sides; expectation is probability.
Monotone continuity: For increasing sets, define disjoint increments; apply countable additivity to the union decomposition.
Micro-glossary for Surviving Textbooks
Sigma-algebra: the menu of legal events you can ask P about; closed under complement and countable unions. No sneaking in weird sets mid-proof.
Almost surely: an event holds with probability 1, but not necessarily in every imaginable outcome.
Measure: fancy word for a size assignment that plays nice with limits; probability is a measure with total size 1.
Wrap-up: The Takeaways That Let You Sleep At Night
- The three axioms — non-negativity, normalization, countable additivity — are the entire skeleton of probability.
- Everything else (complements, union formulas, conditional probability, Bayes, independence) descends from them cleanly.
- Graphical models lean on these axioms to ensure local conditionals assemble into a global distribution and to make marginalization and inference legit.
- If your numbers violate the axioms, your model is not contrarian; it is incoherent. A Dutch book awaits.
One last thought: probabilistic reasoning is a long conversation between local pieces and global totals. The axioms make sure all those voices add up to a single story — the one distribution your graph is trying to tell.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!