Probability Fundamentals
Machine learning is fundamentally about making predictions under uncertainty. We never have perfect information—our training data is finite, our models are approximations, and the world itself is stochastic. Probability theory provides the mathematical language for reasoning about uncertainty, and mastering it is essential for understanding everything from classification to generative models.
What is Probability?
At its simplest, probability quantifies how likely something is to happen. We write $P(A)$ for "the probability of event $A$" and require that:
A probability of 0 means impossible; 1 means certain. But what does a probability of 0.7 actually mean?
The frequentist view: Probability is the long-run frequency of an event. If we say $P(\text{heads}) = 0.5$, we mean that flipping a fair coin many times yields heads about half the time. This interpretation works well for repeatable experiments.
The Bayesian view: Probability represents a degree of belief. When a weather forecast says "70% chance of rain," it's expressing confidence, not claiming that 70% of identical days have rain. This interpretation allows us to assign probabilities to one-time events ("probability that this patient has cancer") and to update beliefs as we gather evidence.
Both views are mathematically equivalent—they follow the same axioms—but lead to different approaches in machine learning. Bayesian methods treat model parameters as random variables with probability distributions, while frequentist methods treat them as fixed but unknown quantities.
The Axioms of Probability
Modern probability theory rests on three axioms, formalized by Kolmogorov in 1933:
Axiom 1 (Non-negativity): For any event $A$, $P(A) \geq 0$.
Axiom 2 (Normalization): The probability of the entire sample space is 1: $P(\Omega) = 1$.
Axiom 3 (Additivity): For mutually exclusive events (events that cannot both occur), $P(A \cup B) = P(A) + P(B)$.
From these three axioms, we can derive all other probability rules. For example, $P(\text{not } A) = 1 - P(A)$ follows from the fact that $A$ and "not $A$" are mutually exclusive and together cover the entire sample space.
Sample Spaces and Events
The sample space $\Omega$ is the set of all possible outcomes of an experiment:
- Coin flip: <!--MATHBLOCK21-->
- Die roll: <!--MATHBLOCK22-->
- Temperature reading: <!--MATHBLOCK23--> (any real number)
An event is a subset of the sample space—a collection of outcomes we're interested in:
- "Roll an even number": <!--MATHBLOCK24-->
- "Temperature above 30°C": <!--MATHBLOCK25-->
For finite sample spaces with equally likely outcomes, probability is simply counting:
import numpy as np
def probability_basics():
"""Demonstrate basic probability through simulation."""
np.random.seed(42)
# Die roll: theoretical vs simulated probabilities
n_rolls = 100000
rolls = np.random.randint(1, 7, size=n_rolls)
# Event A: roll a 6
p_six_sim = np.mean(rolls == 6)
p_six_theory = 1/6
# Event B: roll an even number
p_even_sim = np.mean(rolls % 2 == 0)
p_even_theory = 3/6
print("Die Roll Probabilities")
print("-" * 45)
print(f"{'Event':<25} {'Simulated':>10} {'Theory':>10}")
print("-" * 45)
print(f"{'P(roll = 6)':<25} {p_six_sim:>10.4f} {p_six_theory:>10.4f}")
print(f"{'P(roll is even)':<25} {p_even_sim:>10.4f} {p_even_theory:>10.4f}")
# Verify law of large numbers
print("\nLaw of Large Numbers:")
for n in [10, 100, 1000, 10000, 100000]:
p_estimate = np.mean(rolls[:n] == 6)
error = abs(p_estimate - p_six_theory)
print(f" n = {n:>6}: P(6) = {p_estimate:.4f}, error = {error:.4f}")
probability_basics()Conditional Probability
Often we want to know the probability of an event given that another event has occurred. The conditional probability of $A$ given $B$ is:
This formula captures a simple intuition: to find $P(A|B)$, we restrict our attention to outcomes where $B$ occurred, then ask what fraction of those also have $A$.
Example: A medical test is 95% accurate—it correctly identifies 95% of sick patients and 95% of healthy patients. If 1% of the population has the disease, what's the probability that a positive test means you're actually sick?
This seems like it should be around 95%, but the answer is surprisingly lower. We need Bayes' theorem to work it out properly.
Bayes' Theorem
Bayes' theorem relates conditional probabilities in both directions:
In ML contexts, we often write this as:
Or using standard notation:
- Posterior <!--MATHBLOCK31-->: Probability of hypothesis after seeing data
- Likelihood <!--MATHBLOCK32-->: Probability of data if hypothesis is true
- Prior <!--MATHBLOCK33-->: Probability of hypothesis before seeing data
- Evidence <!--MATHBLOCK34-->: Total probability of observing the data
This is the foundation of Bayesian machine learning: we start with prior beliefs about model parameters, observe data, and update to posterior beliefs.
import numpy as np
def bayes_medical_test():
"""
Classic medical testing example demonstrating Bayes' theorem.
A disease affects 1% of population.
A test has 95% sensitivity (true positive rate) and 95% specificity (true negative rate).
If you test positive, what's the probability you have the disease?
"""
# Given probabilities
p_disease = 0.01 # Prior: P(disease)
p_healthy = 1 - p_disease # P(healthy)
sensitivity = 0.95 # P(positive | disease)
specificity = 0.95 # P(negative | healthy)
# Calculate P(positive)
p_positive_given_disease = sensitivity
p_positive_given_healthy = 1 - specificity # False positive rate
p_positive = (p_positive_given_disease * p_disease +
p_positive_given_healthy * p_healthy)
# Apply Bayes' theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive
print("Medical Test Example (Bayes' Theorem)")
print("=" * 55)
print(f"\nGiven:")
print(f" P(disease) = {p_disease:.2%} (prevalence)")
print(f" P(positive | disease) = {sensitivity:.0%} (sensitivity)")
print(f" P(negative | healthy) = {specificity:.0%} (specificity)")
print(f"\nCalculations:")
print(f" P(positive | healthy) = {p_positive_given_healthy:.0%} (false positive rate)")
print(f" P(positive) = {p_positive:.4f}")
print(f"\nResult:")
print(f" P(disease | positive) = {p_disease_given_positive:.2%}")
print(f"\nInterpretation:")
print(f" Despite 95% test accuracy, a positive result means only")
print(f" ~{p_disease_given_positive:.0%} chance of disease!")
print(f" This is because the disease is rare (low prior).")
bayes_medical_test()Independence
Two events are independent if knowing one tells you nothing about the other:
Equivalently, $P(A|B) = P(A)$—the probability of $A$ is unchanged by knowing $B$.
Example: Successive coin flips are independent. Knowing the first flip was heads doesn't change the probability that the second is heads.
Example: Drawing cards without replacement is not independent. Drawing an ace first changes the probability that the second card is an ace.
Independence is crucial in ML:
- We often assume training examples are independent and identically distributed (i.i.d.)
- This assumption simplifies the math and lets us multiply probabilities
- When the assumption fails (time series, spatial data), we need more sophisticated models
Conditional Independence
Events $A$ and $B$ are conditionally independent given $C$ if:
This is weaker than independence—$A$ and $B$ might be dependent overall but become independent once we condition on $C$.
Example: Whether two people carry umbrellas ($A$ and $B$) is not independent—both are more likely on rainy days. But conditioned on the weather ($C$), they become independent: knowing you have an umbrella tells me nothing about your colleague's umbrella if I already know it's raining.
Conditional independence is the key assumption behind Naive Bayes classifiers and many graphical models.
import numpy as np
def independence_example():
"""Demonstrate independence and conditional independence."""
np.random.seed(42)
n_samples = 100000
# Generate weather (30% chance of rain)
rain = np.random.random(n_samples) < 0.3
# Person A takes umbrella: 80% if rain, 10% if no rain
umbrella_a = np.where(rain,
np.random.random(n_samples) < 0.8,
np.random.random(n_samples) < 0.1)
# Person B takes umbrella: same logic, independent given weather
umbrella_b = np.where(rain,
np.random.random(n_samples) < 0.8,
np.random.random(n_samples) < 0.1)
# Marginal probabilities
p_a = np.mean(umbrella_a)
p_b = np.mean(umbrella_b)
p_ab = np.mean(umbrella_a & umbrella_b)
print("Independence vs Conditional Independence")
print("=" * 55)
print(f"\nMarginal probabilities:")
print(f" P(A has umbrella) = {p_a:.3f}")
print(f" P(B has umbrella) = {p_b:.3f}")
print(f"\nTest for independence:")
print(f" P(A ∩ B) = {p_ab:.3f}")
print(f" P(A) × P(B) = {p_a * p_b:.3f}")
print(f" {'Independent' if abs(p_ab - p_a * p_b) < 0.01 else 'NOT independent'}")
# Conditional independence given rain
rain_mask = rain
p_a_given_rain = np.mean(umbrella_a[rain_mask])
p_b_given_rain = np.mean(umbrella_b[rain_mask])
p_ab_given_rain = np.mean(umbrella_a[rain_mask] & umbrella_b[rain_mask])
print(f"\nConditional on rain:")
print(f" P(A|rain) = {p_a_given_rain:.3f}")
print(f" P(B|rain) = {p_b_given_rain:.3f}")
print(f" P(A ∩ B|rain) = {p_ab_given_rain:.3f}")
print(f" P(A|rain) × P(B|rain) = {p_a_given_rain * p_b_given_rain:.3f}")
print(f" Conditionally independent given rain: Yes")
independence_example()The Law of Total Probability
If events $B_1, B_2, \ldots, B_n$ partition the sample space (they're mutually exclusive and exhaustive), then:
This "marginalizes out" the $B_i$'s, computing the overall probability of $A$ by considering all ways it could happen.
In ML: When computing $P(\text{data})$ in Bayes' theorem, we sum over all possible hypotheses:
This appears constantly in probabilistic models as the "marginal likelihood" or "evidence."
Key Probability Rules Summary
| Rule | Formula | When to Use | |------|---------|-------------| | Complement | $P(\bar{A}) = 1 - P(A)$ | Easier to compute "not A" | | Addition | $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ | Either A or B | | Multiplication | $P(A \cap B) = P(A) \cdot P(B|A)$ | Both A and B | | Bayes | $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ | Reverse conditional | | Total Probability | $P(A) = \sum_i P(A|B_i)P(B_i)$ | Marginalize |
Key Takeaways
- Probability quantifies uncertainty, interpreted as either long-run frequency or degree of belief
- Conditional probability <!--MATHBLOCK56--> restricts the sample space to where <!--MATHBLOCK57--> occurred
- Bayes' theorem relates <!--MATHBLOCK58--> to <!--MATHBLOCK59-->—essential for updating beliefs with evidence
- Independence means <!--MATHBLOCK60-->; conditional independence is independence given additional information
- The i.i.d. assumption (independent and identically distributed) underlies most ML algorithms
- These rules form the foundation for understanding random variables, distributions, and statistical inference