Beginner Intermediate 90 min read

Chapter 3: Probability and Statistics

Probability distributions, estimation, and statistical inference for machine learning.

Libraries covered: NumPy Pandas SciPy

Learning Objectives

["Understand probability fundamentals", "Apply Bayes theorem", "Perform maximum likelihood estimation"]


3.1 Probability Fundamentals Beginner

Probability Fundamentals

Machine learning is fundamentally about making predictions under uncertainty. We never have perfect information—our training data is finite, our models are approximations, and the world itself is stochastic. Probability theory provides the mathematical language for reasoning about uncertainty, and mastering it is essential for understanding everything from classification to generative models.

What is Probability?

At its simplest, probability quantifies how likely something is to happen. We write $P(A)$ for "the probability of event $A$" and require that:

$$0 \leq P(A) \leq 1$$

A probability of 0 means impossible; 1 means certain. But what does a probability of 0.7 actually mean?

The frequentist view: Probability is the long-run frequency of an event. If we say $P(\text{heads}) = 0.5$, we mean that flipping a fair coin many times yields heads about half the time. This interpretation works well for repeatable experiments.

The Bayesian view: Probability represents a degree of belief. When a weather forecast says "70% chance of rain," it's expressing confidence, not claiming that 70% of identical days have rain. This interpretation allows us to assign probabilities to one-time events ("probability that this patient has cancer") and to update beliefs as we gather evidence.

Both views are mathematically equivalent—they follow the same axioms—but lead to different approaches in machine learning. Bayesian methods treat model parameters as random variables with probability distributions, while frequentist methods treat them as fixed but unknown quantities.

The Axioms of Probability

Modern probability theory rests on three axioms, formalized by Kolmogorov in 1933:

Axiom 1 (Non-negativity): For any event $A$, $P(A) \geq 0$.

Axiom 2 (Normalization): The probability of the entire sample space is 1: $P(\Omega) = 1$.

Axiom 3 (Additivity): For mutually exclusive events (events that cannot both occur), $P(A \cup B) = P(A) + P(B)$.

From these three axioms, we can derive all other probability rules. For example, $P(\text{not } A) = 1 - P(A)$ follows from the fact that $A$ and "not $A$" are mutually exclusive and together cover the entire sample space.

Sample Spaces and Events

The sample space $\Omega$ is the set of all possible outcomes of an experiment:

  • Coin flip: <!--MATHBLOCK21-->
  • Die roll: <!--MATHBLOCK22-->
  • Temperature reading: <!--MATHBLOCK23--> (any real number)

An event is a subset of the sample space—a collection of outcomes we're interested in:

  • "Roll an even number": <!--MATHBLOCK24-->
  • "Temperature above 30°C": <!--MATHBLOCK25-->

For finite sample spaces with equally likely outcomes, probability is simply counting:

$$P(A) = \frac{|A|}{|\Omega|} = \frac{\text{favorable outcomes}}{\text{total outcomes}}$$

PYTHON
import numpy as np

def probability_basics():
    """Demonstrate basic probability through simulation."""
    np.random.seed(42)

    # Die roll: theoretical vs simulated probabilities
    n_rolls = 100000
    rolls = np.random.randint(1, 7, size=n_rolls)

    # Event A: roll a 6
    p_six_sim = np.mean(rolls == 6)
    p_six_theory = 1/6

    # Event B: roll an even number
    p_even_sim = np.mean(rolls % 2 == 0)
    p_even_theory = 3/6

    print("Die Roll Probabilities")
    print("-" * 45)
    print(f"{'Event':<25} {'Simulated':>10} {'Theory':>10}")
    print("-" * 45)
    print(f"{'P(roll = 6)':<25} {p_six_sim:>10.4f} {p_six_theory:>10.4f}")
    print(f"{'P(roll is even)':<25} {p_even_sim:>10.4f} {p_even_theory:>10.4f}")

    # Verify law of large numbers
    print("\nLaw of Large Numbers:")
    for n in [10, 100, 1000, 10000, 100000]:
        p_estimate = np.mean(rolls[:n] == 6)
        error = abs(p_estimate - p_six_theory)
        print(f"  n = {n:>6}: P(6) = {p_estimate:.4f}, error = {error:.4f}")

probability_basics()

Conditional Probability

Often we want to know the probability of an event given that another event has occurred. The conditional probability of $A$ given $B$ is:

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

This formula captures a simple intuition: to find $P(A|B)$, we restrict our attention to outcomes where $B$ occurred, then ask what fraction of those also have $A$.

Example: A medical test is 95% accurate—it correctly identifies 95% of sick patients and 95% of healthy patients. If 1% of the population has the disease, what's the probability that a positive test means you're actually sick?

This seems like it should be around 95%, but the answer is surprisingly lower. We need Bayes' theorem to work it out properly.

Bayes' Theorem

Bayes' theorem relates conditional probabilities in both directions:

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

In ML contexts, we often write this as:

$$P(\text{hypothesis}|\text{data}) = \frac{P(\text{data}|\text{hypothesis}) \cdot P(\text{hypothesis})}{P(\text{data})}$$

Or using standard notation:

  • Posterior <!--MATHBLOCK31-->: Probability of hypothesis after seeing data
  • Likelihood <!--MATHBLOCK32-->: Probability of data if hypothesis is true
  • Prior <!--MATHBLOCK33-->: Probability of hypothesis before seeing data
  • Evidence <!--MATHBLOCK34-->: Total probability of observing the data

$$\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}$$

This is the foundation of Bayesian machine learning: we start with prior beliefs about model parameters, observe data, and update to posterior beliefs.

PYTHON
import numpy as np

def bayes_medical_test():
    """
    Classic medical testing example demonstrating Bayes' theorem.

    A disease affects 1% of population.
    A test has 95% sensitivity (true positive rate) and 95% specificity (true negative rate).
    If you test positive, what's the probability you have the disease?
    """

    # Given probabilities
    p_disease = 0.01           # Prior: P(disease)
    p_healthy = 1 - p_disease  # P(healthy)
    sensitivity = 0.95         # P(positive | disease)
    specificity = 0.95         # P(negative | healthy)

    # Calculate P(positive)
    p_positive_given_disease = sensitivity
    p_positive_given_healthy = 1 - specificity  # False positive rate

    p_positive = (p_positive_given_disease * p_disease +
                  p_positive_given_healthy * p_healthy)

    # Apply Bayes' theorem
    p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

    print("Medical Test Example (Bayes' Theorem)")
    print("=" * 55)
    print(f"\nGiven:")
    print(f"  P(disease) = {p_disease:.2%} (prevalence)")
    print(f"  P(positive | disease) = {sensitivity:.0%} (sensitivity)")
    print(f"  P(negative | healthy) = {specificity:.0%} (specificity)")

    print(f"\nCalculations:")
    print(f"  P(positive | healthy) = {p_positive_given_healthy:.0%} (false positive rate)")
    print(f"  P(positive) = {p_positive:.4f}")

    print(f"\nResult:")
    print(f"  P(disease | positive) = {p_disease_given_positive:.2%}")

    print(f"\nInterpretation:")
    print(f"  Despite 95% test accuracy, a positive result means only")
    print(f"  ~{p_disease_given_positive:.0%} chance of disease!")
    print(f"  This is because the disease is rare (low prior).")

bayes_medical_test()

Independence

Two events are independent if knowing one tells you nothing about the other:

$$P(A \cap B) = P(A) \cdot P(B)$$

Equivalently, $P(A|B) = P(A)$—the probability of $A$ is unchanged by knowing $B$.

Example: Successive coin flips are independent. Knowing the first flip was heads doesn't change the probability that the second is heads.

Example: Drawing cards without replacement is not independent. Drawing an ace first changes the probability that the second card is an ace.

Independence is crucial in ML:

  • We often assume training examples are independent and identically distributed (i.i.d.)
  • This assumption simplifies the math and lets us multiply probabilities
  • When the assumption fails (time series, spatial data), we need more sophisticated models
  • Conditional Independence

Events $A$ and $B$ are conditionally independent given $C$ if:

$$P(A \cap B | C) = P(A|C) \cdot P(B|C)$$

This is weaker than independence—$A$ and $B$ might be dependent overall but become independent once we condition on $C$.

Example: Whether two people carry umbrellas ($A$ and $B$) is not independent—both are more likely on rainy days. But conditioned on the weather ($C$), they become independent: knowing you have an umbrella tells me nothing about your colleague's umbrella if I already know it's raining.

Conditional independence is the key assumption behind Naive Bayes classifiers and many graphical models.

PYTHON
import numpy as np

def independence_example():
    """Demonstrate independence and conditional independence."""
    np.random.seed(42)
    n_samples = 100000

    # Generate weather (30% chance of rain)
    rain = np.random.random(n_samples) < 0.3

    # Person A takes umbrella: 80% if rain, 10% if no rain
    umbrella_a = np.where(rain,
                          np.random.random(n_samples) < 0.8,
                          np.random.random(n_samples) < 0.1)

    # Person B takes umbrella: same logic, independent given weather
    umbrella_b = np.where(rain,
                          np.random.random(n_samples) < 0.8,
                          np.random.random(n_samples) < 0.1)

    # Marginal probabilities
    p_a = np.mean(umbrella_a)
    p_b = np.mean(umbrella_b)
    p_ab = np.mean(umbrella_a & umbrella_b)

    print("Independence vs Conditional Independence")
    print("=" * 55)

    print(f"\nMarginal probabilities:")
    print(f"  P(A has umbrella) = {p_a:.3f}")
    print(f"  P(B has umbrella) = {p_b:.3f}")

    print(f"\nTest for independence:")
    print(f"  P(A ∩ B) = {p_ab:.3f}")
    print(f"  P(A) × P(B) = {p_a * p_b:.3f}")
    print(f"  {'Independent' if abs(p_ab - p_a * p_b) < 0.01 else 'NOT independent'}")

    # Conditional independence given rain
    rain_mask = rain
    p_a_given_rain = np.mean(umbrella_a[rain_mask])
    p_b_given_rain = np.mean(umbrella_b[rain_mask])
    p_ab_given_rain = np.mean(umbrella_a[rain_mask] & umbrella_b[rain_mask])

    print(f"\nConditional on rain:")
    print(f"  P(A|rain) = {p_a_given_rain:.3f}")
    print(f"  P(B|rain) = {p_b_given_rain:.3f}")
    print(f"  P(A ∩ B|rain) = {p_ab_given_rain:.3f}")
    print(f"  P(A|rain) × P(B|rain) = {p_a_given_rain * p_b_given_rain:.3f}")
    print(f"  Conditionally independent given rain: Yes")

independence_example()

The Law of Total Probability

If events $B_1, B_2, \ldots, B_n$ partition the sample space (they're mutually exclusive and exhaustive), then:

$$P(A) = \sum_{i=1}^n P(A|B_i) \cdot P(B_i)$$

This "marginalizes out" the $B_i$'s, computing the overall probability of $A$ by considering all ways it could happen.

In ML: When computing $P(\text{data})$ in Bayes' theorem, we sum over all possible hypotheses:

$$P(\text{data}) = \sum_h P(\text{data}|h) \cdot P(h)$$

This appears constantly in probabilistic models as the "marginal likelihood" or "evidence."

Key Probability Rules Summary

| Rule | Formula | When to Use | |------|---------|-------------| | Complement | $P(\bar{A}) = 1 - P(A)$ | Easier to compute "not A" | | Addition | $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ | Either A or B | | Multiplication | $P(A \cap B) = P(A) \cdot P(B|A)$ | Both A and B | | Bayes | $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ | Reverse conditional | | Total Probability | $P(A) = \sum_i P(A|B_i)P(B_i)$ | Marginalize |

Key Takeaways

  • Probability quantifies uncertainty, interpreted as either long-run frequency or degree of belief
  • Conditional probability <!--MATHBLOCK56--> restricts the sample space to where <!--MATHBLOCK57--> occurred
  • Bayes' theorem relates <!--MATHBLOCK58--> to <!--MATHBLOCK59-->—essential for updating beliefs with evidence
  • Independence means <!--MATHBLOCK60-->; conditional independence is independence given additional information
  • The i.i.d. assumption (independent and identically distributed) underlies most ML algorithms
  • These rules form the foundation for understanding random variables, distributions, and statistical inference

3.2 Random Variables and Distributions Beginner

Random Variables and Distributions

In probability, we often care about numerical outcomes rather than abstract events. A random variable assigns numbers to outcomes, letting us apply mathematical tools to uncertain quantities. Probability distributions describe how likely each value is, and certain distributions appear so frequently in machine learning that recognizing them becomes second nature.

What is a Random Variable?

A random variable $X$ is a function that assigns a number to each outcome in a sample space. Instead of talking about the event "the die shows 6," we define $X$ = "the number shown" and ask about $P(X = 6)$.

Random variables come in two flavors:

Discrete random variables take on countable values (often integers):

  • Number of heads in 10 coin flips: <!--MATHBLOCK16-->
  • Number of customers arriving per hour: <!--MATHBLOCK17-->

Continuous random variables take on uncountably many values (typically real numbers):

  • Height of a randomly selected person: <!--MATHBLOCK18-->
  • Time until a server request completes: <!--MATHBLOCK19-->

The distinction matters because we describe their probabilities differently: discrete variables have probability mass functions; continuous variables have probability density functions.

Probability Mass Functions (PMFs)

For a discrete random variable $X$, the probability mass function $p(x) = P(X = x)$ gives the probability of each possible value.

Requirements:

  • <!--MATHBLOCK22--> for all <!--MATHBLOCK23-->
  • <!--MATHBLOCK24--> (probabilities sum to 1)

Example: For a fair die, $p(x) = 1/6$ for $x \in \{1, 2, 3, 4, 5, 6\}$.

PYTHON
import numpy as np
import matplotlib.pyplot as plt

def pmf_example():
    """Demonstrate probability mass functions."""

    # Binomial distribution: number of heads in n coin flips
    n = 10  # number of flips
    p = 0.5  # probability of heads

    # Calculate PMF
    from math import comb
    x_values = np.arange(0, n + 1)
    pmf = np.array([comb(n, k) * p**k * (1-p)**(n-k) for k in x_values])

    print("Binomial PMF: Number of heads in 10 fair coin flips")
    print("-" * 50)
    print(f"{'k (heads)':<12} {'P(X = k)':<12} {'Cumulative':<12}")
    print("-" * 50)

    cumulative = 0
    for k, prob in zip(x_values, pmf):
        cumulative += prob
        if prob > 0.01:  # Only show non-negligible probabilities
            print(f"{k:<12} {prob:<12.4f} {cumulative:<12.4f}")

    print(f"\nSum of all probabilities: {pmf.sum():.6f} (should be 1.0)")
    print(f"Most likely outcome: {x_values[np.argmax(pmf)]} heads")

pmf_example()

Probability Density Functions (PDFs)

For continuous random variables, we can't assign probability to individual points (there are uncountably many, so each would have probability 0). Instead, we use a probability density function $f(x)$ where:

$$P(a \leq X \leq b) = \int_a^b f(x) \, dx$$

The density $f(x)$ isn't a probability itself—it can even exceed 1. But the area under the curve between any two points gives the probability of landing in that interval.

Requirements:

  • <!--MATHBLOCK29--> for all <!--MATHBLOCK30-->
  • <!--MATHBLOCK31-->

Key intuition: $f(x) \cdot dx$ is approximately the probability of landing in a tiny interval around $x$.

The Gaussian (Normal) Distribution

The most important distribution in statistics is the Gaussian or normal distribution:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$

We write $X \sim \mathcal{N}(\mu, \sigma^2)$ where:

  • <!--MATHBLOCK35--> is the mean (center of the bell curve)
  • <!--MATHBLOCK36--> is the variance (spread)
  • <!--MATHBLOCK37--> is the standard deviation

Why is the Gaussian everywhere?

The Central Limit Theorem explains it: the sum of many independent random variables tends toward a Gaussian, regardless of the original distributions. Since many natural quantities result from aggregating many small effects, they end up approximately Gaussian.

In ML, we often assume:

  • Noise is Gaussian (measurement errors, model residuals)
  • Priors are Gaussian (weight initialization, regularization)
  • Posteriors are approximately Gaussian (Laplace approximation)

PYTHON
import numpy as np

def gaussian_distribution():
    """Explore the Gaussian distribution."""

    def gaussian_pdf(x, mu, sigma):
        return (1 / (np.sqrt(2 * np.pi) * sigma)) * np.exp(-0.5 * ((x - mu) / sigma)**2)

    # Standard normal: mean=0, std=1
    x = np.linspace(-4, 4, 1000)
    pdf_standard = gaussian_pdf(x, mu=0, sigma=1)

    print("Gaussian (Normal) Distribution")
    print("=" * 55)

    # Key properties
    print("\nKey properties of N(μ, σ²):")
    print("  • Mean = μ (center)")
    print("  • Variance = σ² (spread)")
    print("  • Standard deviation = σ")
    print("  • Symmetric around μ")
    print("  • 68-95-99.7 rule:")

    # Demonstrate the 68-95-99.7 rule via simulation
    np.random.seed(42)
    samples = np.random.normal(0, 1, 100000)

    within_1_std = np.mean(np.abs(samples) < 1)
    within_2_std = np.mean(np.abs(samples) < 2)
    within_3_std = np.mean(np.abs(samples) < 3)

    print(f"    - Within 1σ: {within_1_std:.1%} (theory: 68.3%)")
    print(f"    - Within 2σ: {within_2_std:.1%} (theory: 95.4%)")
    print(f"    - Within 3σ: {within_3_std:.1%} (theory: 99.7%)")

    # Different Gaussians
    print("\nComparing different Gaussians:")
    print(f"  N(0, 1):  mean=0, std=1 (standard normal)")
    print(f"  N(2, 1):  mean=2, std=1 (shifted right)")
    print(f"  N(0, 4):  mean=0, std=2 (wider spread)")

gaussian_distribution()

Common Discrete Distributions

Bernoulli Distribution: Single binary outcome (coin flip)

$$P(X = 1) = p, \quad P(X = 0) = 1 - p$$

Used for: Binary classification outcomes, click/no-click events

Binomial Distribution: Number of successes in $n$ independent Bernoulli trials

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$

Used for: Counting successes in fixed trials, modeling proportions

Categorical Distribution: Generalization of Bernoulli to $K$ categories

$$P(X = k) = p_k, \quad \sum_{k=1}^K p_k = 1$$

Used for: Multi-class classification outputs (softmax)

Poisson Distribution: Count of events in a fixed interval

$$P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

Used for: Modeling rare events, count data (website visits, word frequencies)

PYTHON
import numpy as np
from math import comb, factorial, exp

def discrete_distributions():
    """Compare common discrete distributions."""

    print("Common Discrete Distributions")
    print("=" * 60)

    # Bernoulli
    p = 0.3
    print(f"\n1. Bernoulli(p={p}):")
    print(f"   P(X=0) = {1-p:.2f}, P(X=1) = {p:.2f}")
    print(f"   Mean = p = {p:.2f}")
    print(f"   Use case: Binary outcomes (spam/not spam)")

    # Binomial
    n, p = 10, 0.3
    print(f"\n2. Binomial(n={n}, p={p}):")
    mean_binom = n * p
    var_binom = n * p * (1 - p)
    print(f"   Mean = np = {mean_binom:.2f}")
    print(f"   Variance = np(1-p) = {var_binom:.2f}")
    print(f"   Use case: Number of successes in n trials")

    # Simulate and show distribution
    samples = np.random.binomial(n, p, 10000)
    print(f"   Sample distribution:")
    for k in range(n + 1):
        count = np.sum(samples == k)
        if count > 100:
            print(f"     P(X={k}) ≈ {count/10000:.3f}")

    # Poisson
    lam = 5
    print(f"\n3. Poisson(λ={lam}):")
    print(f"   Mean = λ = {lam}")
    print(f"   Variance = λ = {lam}")
    print(f"   Use case: Rare events (arrivals, counts)")

    samples = np.random.poisson(lam, 10000)
    print(f"   Sample distribution:")
    for k in range(12):
        count = np.sum(samples == k)
        if count > 100:
            print(f"     P(X={k}) ≈ {count/10000:.3f}")

discrete_distributions()

Common Continuous Distributions

Uniform Distribution: Equal probability across an interval

$$f(x) = \frac{1}{b-a} \quad \text{for } x \in [a, b]$$

Used for: Random initialization, random search

Exponential Distribution: Time until first event (memoryless)

$$f(x) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0$$

Used for: Modeling waiting times, decay processes

Beta Distribution: Probability of probabilities (bounded [0, 1])

$$f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}$$

Used for: Prior distributions for probabilities, Bayesian A/B testing

Multivariate Gaussian: Gaussian generalized to vectors

$$f(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^k|\Sigma|}} \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)$$

Used for: Modeling correlated features, latent spaces, Gaussian processes

PYTHON
import numpy as np

def continuous_distributions():
    """Explore continuous distributions."""

    np.random.seed(42)

    print("Common Continuous Distributions")
    print("=" * 60)

    # Uniform
    print("\n1. Uniform(a=0, b=1):")
    samples = np.random.uniform(0, 1, 10000)
    print(f"   Mean (theoretical): 0.5")
    print(f"   Mean (sample): {samples.mean():.4f}")
    print(f"   Variance (theoretical): 1/12 = {1/12:.4f}")
    print(f"   Variance (sample): {samples.var():.4f}")

    # Exponential
    lam = 2
    print(f"\n2. Exponential(λ={lam}):")
    samples = np.random.exponential(1/lam, 10000)
    print(f"   Mean (theoretical): 1/λ = {1/lam:.4f}")
    print(f"   Mean (sample): {samples.mean():.4f}")
    print(f"   Memoryless property: P(X > s+t | X > s) = P(X > t)")

    # Beta
    alpha, beta = 2, 5
    print(f"\n3. Beta(α={alpha}, β={beta}):")
    samples = np.random.beta(alpha, beta, 10000)
    mean_theory = alpha / (alpha + beta)
    print(f"   Support: [0, 1] (great for probabilities)")
    print(f"   Mean (theoretical): α/(α+β) = {mean_theory:.4f}")
    print(f"   Mean (sample): {samples.mean():.4f}")
    print(f"   Mode (most likely): {(alpha-1)/(alpha+beta-2):.4f}")

    # Multivariate Gaussian
    print(f"\n4. Multivariate Gaussian:")
    mean = np.array([0, 0])
    cov = np.array([[1, 0.8], [0.8, 1]])  # Correlated
    samples = np.random.multivariate_normal(mean, cov, 1000)
    print(f"   Mean vector: {mean}")
    print(f"   Covariance matrix:\n   {cov}")
    print(f"   Correlation: {cov[0,1]:.2f} (strong positive)")
    sample_corr = np.corrcoef(samples[:, 0], samples[:, 1])[0, 1]
    print(f"   Sample correlation: {sample_corr:.4f}")

continuous_distributions()

The Cumulative Distribution Function (CDF)

The CDF $F(x) = P(X \leq x)$ gives the probability that $X$ is at most $x$. It's defined for both discrete and continuous variables and has useful properties:

  • <!--MATHBLOCK43--> is non-decreasing
  • <!--MATHBLOCK44-->
  • <!--MATHBLOCK45-->
  • For continuous variables: <!--MATHBLOCK46-->

The CDF is useful for computing interval probabilities:

$$P(a < X \leq b) = F(b) - F(a)$$

Transformations of Random Variables

If $X$ has distribution $f_X(x)$ and $Y = g(X)$, what's the distribution of $Y$?

For monotonic transformations, the change of variables formula applies:

$$f_Y(y) = f_X(g^{-1}(y)) \left|\frac{dg^{-1}}{dy}\right|$$

Common in ML:

  • Log transformation: <!--MATHBLOCK51--> normalizes heavy-tailed data
  • Softmax: Transforms real numbers to probability simplex
  • Batch normalization: Transforms to standard normal-like distributions
  • Sampling and Monte Carlo

We can compute probabilities and expectations by sampling—generating random draws from a distribution:

$$E[f(X)] \approx \frac{1}{n}\sum_{i=1}^n f(x_i) \quad \text{where } x_i \sim p(x)$$

This Monte Carlo approach works when analytical solutions are intractable. Much of modern ML (variational inference, reinforcement learning) relies heavily on sampling.

PYTHON
import numpy as np

def monte_carlo_example():
    """Demonstrate Monte Carlo estimation."""

    np.random.seed(42)

    # Estimate E[X²] where X ~ N(0, 1)
    # Theoretical: E[X²] = Var(X) + E[X]² = 1 + 0 = 1

    print("Monte Carlo Estimation")
    print("=" * 55)
    print("Estimating E[X²] where X ~ N(0, 1)")
    print("Theoretical value: 1.0")
    print()

    for n in [100, 1000, 10000, 100000]:
        samples = np.random.normal(0, 1, n)
        estimate = np.mean(samples ** 2)
        std_error = np.std(samples ** 2) / np.sqrt(n)
        print(f"n = {n:>6}: E[X²] ≈ {estimate:.4f} ± {std_error:.4f}")

    print("\nNote: Error decreases as 1/√n (law of large numbers)")

monte_carlo_example()

Key Takeaways

  • Random variables map outcomes to numbers; discrete (PMF) vs continuous (PDF)
  • PMFs give point probabilities; PDFs give density (integrate for probabilities)
  • The Gaussian is fundamental due to the Central Limit Theorem
  • Discrete distributions: Bernoulli (binary), Binomial (counts), Categorical (multi-class), Poisson (rare events)
  • Continuous distributions: Uniform, Exponential, Beta, Multivariate Gaussian
  • CDFs give cumulative probabilities and work for both discrete and continuous
  • Monte Carlo estimation approximates expectations through sampling

Understanding these distributions helps you choose appropriate models, interpret outputs, and debug issues when assumptions are violated.

3.3 Expectation and Variance Beginner

Expectation and Variance

Raw probability distributions contain complete information about a random variable, but they're often too detailed to work with directly. Summary statistics—expectation, variance, and their generalizations—distill distributions into interpretable numbers. These concepts underpin loss functions, optimization objectives, and uncertainty quantification throughout machine learning.

Expected Value: The Long-Run Average

The expected value (or mean) of a random variable is its long-run average value. For a discrete variable:

$$E[X] = \sum_x x \cdot P(X = x)$$

For a continuous variable:

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$$

The notation $\mu = E[X]$ is common. We also write $\mathbb{E}[X]$ in some contexts.

Intuition: If you repeated the random experiment infinitely many times and averaged the results, you'd get $E[X]$. It's the "center of mass" of the distribution.

Example: For a fair die, $E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = 3.5$. You'll never roll 3.5, but it's the average outcome.

Properties of Expectation

Expectation has several useful properties that simplify calculations:

Linearity: For any constants $a, b$ and random variables $X, Y$:

$$E[aX + b] = aE[X] + b$$
$$E[X + Y] = E[X] + E[Y]$$

Linearity holds even when $X$ and $Y$ are dependent! This makes expectation easier to work with than many other statistics.

Expectation of a Function: For $Y = g(X)$:

$$E[g(X)] = \sum_x g(x) \cdot P(X = x) \quad \text{(discrete)}$$
$$E[g(X)] = \int g(x) \cdot f(x) \, dx \quad \text{(continuous)}$$

Warning: In general, $E[g(X)] \neq g(E[X])$. Equality holds only for linear $g$.

PYTHON
import numpy as np

def expectation_demo():
    """Demonstrate properties of expected value."""

    np.random.seed(42)
    n = 100000

    # Die roll simulation
    die_rolls = np.random.randint(1, 7, n)
    print("Expected Value Demonstrations")
    print("=" * 55)

    print("\n1. Fair Die:")
    theoretical = 3.5
    simulated = die_rolls.mean()
    print(f"   E[X] theoretical: {theoretical}")
    print(f"   E[X] simulated:   {simulated:.4f}")

    # Linearity
    print("\n2. Linearity: E[2X + 3] = 2E[X] + 3")
    transformed = 2 * die_rolls + 3
    print(f"   E[2X + 3] theoretical: {2 * 3.5 + 3}")
    print(f"   E[2X + 3] simulated:   {transformed.mean():.4f}")

    # Sum of random variables
    print("\n3. Sum: E[X + Y] = E[X] + E[Y]")
    die1 = np.random.randint(1, 7, n)
    die2 = np.random.randint(1, 7, n)
    print(f"   E[X + Y] theoretical: {3.5 + 3.5}")
    print(f"   E[X + Y] simulated:   {(die1 + die2).mean():.4f}")

    # E[g(X)] ≠ g(E[X]) in general
    print("\n4. Non-linearity: E[X²] ≠ (E[X])²")
    print(f"   E[X²] simulated:   {(die_rolls ** 2).mean():.4f}")
    print(f"   (E[X])² simulated: {(die_rolls.mean()) ** 2:.4f}")

expectation_demo()

Variance: Measuring Spread

While expectation tells us where the distribution is centered, variance measures how spread out it is:

$$\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$$

The second form (computational formula) is often easier to calculate. Variance is always non-negative, and $\text{Var}(X) = 0$ only if $X$ is constant.

Standard deviation $\sigma = \sqrt{\text{Var}(X)}$ has the same units as $X$, making it more interpretable for comparing to the mean.

Intuition: Variance is the average squared deviation from the mean. Large variance means outcomes are frequently far from the expected value.

Properties of Variance

Scaling: $\text{Var}(aX + b) = a^2 \text{Var}(X)$

Constants don't add variance (shifting doesn't spread data), but scaling by $a$ scales variance by $a^2$.

Sum of Independent Variables: If $X$ and $Y$ are independent:

$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$$

This only holds for independent variables! For dependent variables, we need covariance.

PYTHON
import numpy as np

def variance_demo():
    """Demonstrate variance properties."""

    np.random.seed(42)
    n = 100000

    # Standard normal
    X = np.random.normal(0, 1, n)

    print("Variance Demonstrations")
    print("=" * 55)

    print("\n1. Standard Normal N(0,1):")
    print(f"   Var(X) theoretical: 1.0")
    print(f"   Var(X) simulated:   {X.var():.4f}")

    # Scaling property
    print("\n2. Scaling: Var(3X + 2) = 9·Var(X)")
    Y = 3 * X + 2
    print(f"   Var(3X + 2) theoretical: {9 * 1.0}")
    print(f"   Var(3X + 2) simulated:   {Y.var():.4f}")
    print(f"   Note: Adding 2 didn't change variance")

    # Sum of independent variables
    print("\n3. Sum of Independent: Var(X + Y) = Var(X) + Var(Y)")
    X1 = np.random.normal(0, 2, n)  # Var = 4
    X2 = np.random.normal(0, 3, n)  # Var = 9
    print(f"   Var(X₁) = {X1.var():.4f} (theoretical: 4)")
    print(f"   Var(X₂) = {X2.var():.4f} (theoretical: 9)")
    print(f"   Var(X₁ + X₂) = {(X1 + X2).var():.4f} (theoretical: 13)")

    # Average reduces variance
    print("\n4. Averaging Reduces Variance:")
    print("   Var(X̄) = Var(X) / n (for n i.i.d. samples)")
    for num_samples in [1, 10, 100, 1000]:
        means = []
        for _ in range(10000):
            sample = np.random.normal(0, 1, num_samples)
            means.append(sample.mean())
        print(f"   n={num_samples:4d}: Var(X̄) = {np.var(means):.4f} "
              f"(theoretical: {1.0/num_samples:.4f})")

variance_demo()

Covariance: Measuring Joint Variability

Covariance measures how two variables vary together:

$$\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$$

  • <!--MATHBLOCK34-->: When <!--MATHBLOCK35--> is above average, <!--MATHBLOCK36--> tends to be above average
  • <!--MATHBLOCK37-->: When <!--MATHBLOCK38--> is above average, <!--MATHBLOCK39--> tends to be below average
  • <!--MATHBLOCK40-->: No linear relationship (but variables may still be dependent!)

Variance of sums (general case):

$$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$$

Correlation: Standardized Covariance

Correlation normalizes covariance to lie in $[-1, 1]$:

$$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$

  • <!--MATHBLOCK42-->: Perfect positive linear relationship
  • <!--MATHBLOCK43-->: Perfect negative linear relationship
  • <!--MATHBLOCK44-->: No linear relationship

Correlation is scale-invariant: multiplying $X$ or $Y$ by constants doesn't change $\rho$.

Warning: Correlation measures linear relationships only. Variables can have $\rho = 0$ yet be perfectly dependent (e.g., $Y = X^2$ when $X$ is symmetric around 0).

PYTHON
import numpy as np

def covariance_correlation_demo():
    """Demonstrate covariance and correlation."""

    np.random.seed(42)
    n = 10000

    print("Covariance and Correlation")
    print("=" * 55)

    # Generate correlated data
    mean = [0, 0]

    # Positive correlation
    cov_pos = [[1, 0.8], [0.8, 1]]
    data_pos = np.random.multivariate_normal(mean, cov_pos, n)

    # Negative correlation
    cov_neg = [[1, -0.6], [-0.6, 1]]
    data_neg = np.random.multivariate_normal(mean, cov_neg, n)

    # No correlation
    cov_zero = [[1, 0], [0, 1]]
    data_zero = np.random.multivariate_normal(mean, cov_zero, n)

    print("\n1. Positive Correlation (ρ = 0.8):")
    corr = np.corrcoef(data_pos[:, 0], data_pos[:, 1])[0, 1]
    print(f"   Sample correlation: {corr:.4f}")
    print("   Interpretation: When X is high, Y tends to be high")

    print("\n2. Negative Correlation (ρ = -0.6):")
    corr = np.corrcoef(data_neg[:, 0], data_neg[:, 1])[0, 1]
    print(f"   Sample correlation: {corr:.4f}")
    print("   Interpretation: When X is high, Y tends to be low")

    print("\n3. Zero Correlation (ρ = 0):")
    corr = np.corrcoef(data_zero[:, 0], data_zero[:, 1])[0, 1]
    print(f"   Sample correlation: {corr:.4f}")
    print("   Interpretation: X and Y are linearly unrelated")

    # Zero correlation ≠ independence
    print("\n4. Zero Correlation ≠ Independence:")
    X = np.random.uniform(-1, 1, n)
    Y = X ** 2  # Y = X², perfectly dependent
    corr = np.corrcoef(X, Y)[0, 1]
    print(f"   Y = X², correlation: {corr:.4f} (≈ 0)")
    print("   But Y is completely determined by X!")

covariance_correlation_demo()

The Covariance Matrix

For a random vector $\mathbf{X} = (X_1, X_2, \ldots, X_n)^T$, the covariance matrix $\Sigma$ has entries:

$$\Sigma_{ij} = \text{Cov}(X_i, X_j)$$

Diagonal entries are variances; off-diagonal entries are covariances. The matrix is symmetric and positive semi-definite.

In ML, covariance matrices appear in:

  • Multivariate Gaussian distributions
  • Principal Component Analysis (PCA)
  • Mahalanobis distance
  • Gaussian Processes

PYTHON
import numpy as np

def covariance_matrix_demo():
    """Demonstrate covariance matrices."""

    np.random.seed(42)

    # Generate 3D correlated data
    true_cov = np.array([
        [1.0, 0.5, 0.2],
        [0.5, 2.0, -0.3],
        [0.2, -0.3, 0.5]
    ])
    mean = [0, 0, 0]

    data = np.random.multivariate_normal(mean, true_cov, 10000)

    # Estimate covariance matrix
    sample_cov = np.cov(data.T)

    print("Covariance Matrix")
    print("=" * 55)

    print("\nTrue Covariance Matrix:")
    print(true_cov)

    print("\nSample Covariance Matrix (n=10000):")
    print(np.round(sample_cov, 3))

    print("\nInterpretation:")
    print(f"  Var(X₁) = {sample_cov[0,0]:.3f}")
    print(f"  Var(X₂) = {sample_cov[1,1]:.3f}")
    print(f"  Var(X₃) = {sample_cov[2,2]:.3f}")
    print(f"  Cov(X₁, X₂) = {sample_cov[0,1]:.3f} (positive)")
    print(f"  Cov(X₂, X₃) = {sample_cov[1,2]:.3f} (negative)")

covariance_matrix_demo()

Expectations in Machine Learning

Expected value appears throughout ML in various guises:

Loss Functions: Training minimizes expected loss over the data distribution:

$$\min_\theta E_{(x,y) \sim P}[L(f_\theta(x), y)]$$

In practice, we approximate with the empirical average over training data.

Risk vs Empirical Risk:

  • True risk: <!--MATHBLOCK53--> (what we want to minimize)
  • Empirical risk: <!--MATHBLOCK54--> (what we compute)

The gap between them is the generalization error.

Bias-Variance Decomposition: For squared error loss:

$$E[(f_\theta(x) - y)^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

This shows that model error comes from systematic bias, sensitivity to training data, and inherent noise.

PYTHON
import numpy as np

def bias_variance_demo():
    """Demonstrate the bias-variance tradeoff."""

    np.random.seed(42)

    # True function: f(x) = sin(x)
    def true_f(x):
        return np.sin(x)

    # Generate multiple training sets and fit polynomials
    n_datasets = 200
    n_samples = 20
    test_x = 2.0  # Point to evaluate predictions

    predictions = {1: [], 3: [], 9: []}  # Different polynomial degrees

    for _ in range(n_datasets):
        # Generate noisy training data
        X = np.random.uniform(0, 4, n_samples)
        y = true_f(X) + np.random.normal(0, 0.3, n_samples)

        # Fit polynomials of different degrees
        for degree in predictions.keys():
            coeffs = np.polyfit(X, y, degree)
            pred = np.polyval(coeffs, test_x)
            predictions[degree].append(pred)

    print("Bias-Variance Tradeoff")
    print("=" * 55)
    print(f"True f({test_x}) = sin({test_x}) = {true_f(test_x):.4f}")
    print()

    print(f"{'Degree':<8} {'Mean Pred':<12} {'Bias²':<10} {'Variance':<10} {'MSE':<10}")
    print("-" * 55)

    for degree in [1, 3, 9]:
        preds = np.array(predictions[degree])
        mean_pred = preds.mean()
        bias_sq = (mean_pred - true_f(test_x)) ** 2
        variance = preds.var()
        mse = bias_sq + variance

        print(f"{degree:<8} {mean_pred:<12.4f} {bias_sq:<10.4f} {variance:<10.4f} {mse:<10.4f}")

    print()
    print("Low degree (1):  High bias, low variance (underfitting)")
    print("High degree (9): Low bias, high variance (overfitting)")
    print("Medium (3):      Balanced, lowest MSE")

bias_variance_demo()

Sample Statistics vs Population Parameters

It's crucial to distinguish between:

  • Population parameters: True values for the entire distribution (<!--MATHBLOCK55-->, <!--MATHBLOCK56-->)
  • Sample statistics: Estimates from finite data (<!--MATHBLOCK57-->, <!--MATHBLOCK58-->)

Sample mean: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ is an unbiased estimator of $\mu$

Sample variance: $s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$ is unbiased for $\sigma^2$

The $n-1$ (Bessel's correction) accounts for using the sample mean instead of the true mean.

Key Takeaways

  • Expected value <!--MATHBLOCK64--> is the long-run average; linear in its argument
  • Variance <!--MATHBLOCK65--> measures spread; affected by scaling but not shifting
  • Covariance measures joint variability; zero covariance doesn't imply independence
  • Correlation normalizes covariance to <!--MATHBLOCK66-->; measures linear relationships only
  • Covariance matrices encode all pairwise relationships in a random vector
  • ML loss functions are expectations; we approximate with sample averages
  • Bias-variance tradeoff: Model error decomposes into systematic bias and sensitivity to training data
  • Sample statistics estimate population parameters with inherent uncertainty

3.4 Maximum Likelihood Estimation Intermediate

Maximum Likelihood Estimation

How do we find the best parameters for a probabilistic model? Maximum Likelihood Estimation (MLE) provides a principled answer: choose parameters that make the observed data most probable. This simple idea underlies most of machine learning, from linear regression to deep neural networks.

The Core Idea

Suppose we have data $\mathcal{D} = \{x_1, x_2, \ldots, x_n\}$ and a parametric model $p(x|\theta)$ that describes how the data is generated. Different values of $\theta$ make the observed data more or less probable.

The likelihood function treats the probability as a function of parameters:

$$L(\theta) = p(\mathcal{D}|\theta) = p(x_1, x_2, \ldots, x_n|\theta)$$

If the data points are independent and identically distributed (i.i.d.), this factors:

$$L(\theta) = \prod_{i=1}^n p(x_i|\theta)$$

The Maximum Likelihood Estimate (MLE) is the parameter value that maximizes the likelihood:

$$\hat{\theta}_{MLE} = \arg\max_\theta L(\theta)$$

Intuition: MLE asks "which parameter setting would have made this data most likely to occur?" and picks that setting.

Why Log-Likelihood?

Products are numerically unstable (many small probabilities multiplied together underflow to zero) and difficult to differentiate. Taking the logarithm converts products to sums:

$$\ell(\theta) = \log L(\theta) = \sum_{i=1}^n \log p(x_i|\theta)$$

Since $\log$ is monotonically increasing, maximizing $\ell(\theta)$ gives the same answer as maximizing $L(\theta)$. The log-likelihood is:

  • Numerically stable
  • A sum rather than a product
  • Easier to differentiate

In ML, we often minimize negative log-likelihood (NLL) instead, which is equivalent to maximizing likelihood:

$$\hat{\theta} = \arg\min_\theta \left[-\sum_{i=1}^n \log p(x_i|\theta)\right]$$

MLE for the Gaussian Distribution

Consider estimating the mean and variance of a Gaussian distribution from data. If $x_i \sim \mathcal{N}(\mu, \sigma^2)$, the log-likelihood is:

$$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$$

Taking derivatives and setting to zero:

$$\hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}$$

$$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2$$

The MLE for the mean is the sample mean—a reassuring result. Note that the variance MLE divides by $n$, not $n-1$; it's biased but consistent (becomes unbiased as $n \to \infty$).

PYTHON
import numpy as np

def gaussian_mle_demo():
    """Demonstrate MLE for Gaussian parameters."""

    np.random.seed(42)

    # True parameters
    true_mu = 5.0
    true_sigma = 2.0

    print("MLE for Gaussian Distribution")
    print("=" * 55)
    print(f"True parameters: μ = {true_mu}, σ = {true_sigma}")
    print()

    for n in [10, 100, 1000, 10000]:
        # Generate data
        data = np.random.normal(true_mu, true_sigma, n)

        # MLE estimates
        mu_mle = np.mean(data)
        sigma2_mle = np.mean((data - mu_mle) ** 2)  # Divides by n
        sigma_mle = np.sqrt(sigma2_mle)

        # Unbiased estimate for comparison
        sigma2_unbiased = np.var(data, ddof=1)  # Divides by n-1

        print(f"n = {n:>5}: μ̂ = {mu_mle:.4f}, σ̂_MLE = {sigma_mle:.4f}, "
              f"σ̂_unbiased = {np.sqrt(sigma2_unbiased):.4f}")

    print()
    print("Note: MLE variance is biased (divides by n, not n-1)")
    print("      but bias vanishes as n → ∞")

gaussian_mle_demo()

MLE for Bernoulli Distribution

For binary data $x_i \in \{0, 1\}$ with $p(x_i = 1) = \theta$, the likelihood is:

$$L(\theta) = \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i}$$

Log-likelihood:

$$\ell(\theta) = \left(\sum_{i=1}^n x_i\right) \log\theta + \left(n - \sum_{i=1}^n x_i\right) \log(1-\theta)$$

Taking the derivative and solving:

$$\hat{\theta}_{MLE} = \frac{1}{n}\sum_{i=1}^n x_i$$

The MLE is simply the sample proportion—the fraction of 1's in the data.

PYTHON
import numpy as np

def bernoulli_mle_demo():
    """Demonstrate MLE for Bernoulli parameter."""

    np.random.seed(42)

    # True probability
    true_p = 0.7

    print("MLE for Bernoulli Distribution")
    print("=" * 55)
    print(f"True parameter: p = {true_p}")
    print()

    for n in [10, 50, 100, 500, 1000]:
        # Generate binary data
        data = np.random.binomial(1, true_p, n)

        # MLE: sample proportion
        p_mle = np.mean(data)

        # Standard error
        se = np.sqrt(p_mle * (1 - p_mle) / n)

        print(f"n = {n:>4}: p̂ = {p_mle:.4f} ± {se:.4f} "
              f"(true in [{p_mle-1.96*se:.3f}, {p_mle+1.96*se:.3f}]? "
              f"{'✓' if abs(p_mle - true_p) < 1.96*se else '✗'})")

bernoulli_mle_demo()

MLE as Cross-Entropy Minimization

For classification, MLE is equivalent to minimizing cross-entropy. Given data $(x_i, y_i)$ and model predictions $p_\theta(y|x)$, the negative log-likelihood is:

$$-\sum_{i=1}^n \log p_\theta(y_i|x_i)$$

For binary classification with sigmoid output $\hat{p} = \sigma(f_\theta(x))$:

$$-\sum_{i=1}^n \left[y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i)\right]$$

This is exactly the binary cross-entropy loss. For multi-class with softmax:

$$-\sum_{i=1}^n \log p_\theta(y_i|x_i)$$

This is the categorical cross-entropy loss.

So when you train a classifier with cross-entropy loss, you're doing maximum likelihood estimation!

PYTHON
import numpy as np

def cross_entropy_mle_connection():
    """Show that cross-entropy loss is negative log-likelihood."""

    np.random.seed(42)

    # Simulated binary classification
    n = 1000
    X = np.random.randn(n, 2)
    true_w = np.array([1.5, -1.0])
    true_b = 0.5

    # True probabilities and labels
    logits = X @ true_w + true_b
    probs = 1 / (1 + np.exp(-logits))
    y = (np.random.rand(n) < probs).astype(float)

    # Model predictions (slightly wrong parameters)
    w = np.array([1.0, -0.5])
    b = 0.0
    pred_logits = X @ w + b
    pred_probs = 1 / (1 + np.exp(-pred_logits))
    pred_probs = np.clip(pred_probs, 1e-7, 1 - 1e-7)  # Numerical stability

    # Compute negative log-likelihood
    nll = -np.mean(y * np.log(pred_probs) + (1 - y) * np.log(1 - pred_probs))

    # This is exactly binary cross-entropy!
    bce = -np.mean(y * np.log(pred_probs) + (1 - y) * np.log(1 - pred_probs))

    print("Cross-Entropy = Negative Log-Likelihood")
    print("=" * 55)
    print(f"\nNegative log-likelihood: {nll:.4f}")
    print(f"Binary cross-entropy:    {bce:.4f}")
    print(f"\nThey're identical! Training with cross-entropy loss")
    print("is maximum likelihood estimation for classification.")

cross_entropy_mle_connection()

MLE for Linear Regression

In linear regression, we assume:

$$y_i = \mathbf{w}^T\mathbf{x}_i + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)$$

This means $y_i | \mathbf{x}_i \sim \mathcal{N}(\mathbf{w}^T\mathbf{x}_i, \sigma^2)$. The log-likelihood is:

$$\ell(\mathbf{w}) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2$$

The terms involving $\sigma$ don't affect the optimal $\mathbf{w}$, so maximizing likelihood is equivalent to minimizing:

$$\sum_{i=1}^n (y_i - \mathbf{w}^T\mathbf{x}_i)^2$$

This is the mean squared error! So least squares regression is MLE under Gaussian noise.

PYTHON
import numpy as np

def linear_regression_mle():
    """Show that least squares is MLE for Gaussian noise."""

    np.random.seed(42)

    # Generate data: y = 2x + 1 + noise
    n = 100
    X = np.random.randn(n, 1)
    X_design = np.hstack([np.ones((n, 1)), X])  # Add intercept
    true_w = np.array([1.0, 2.0])  # [intercept, slope]
    noise_std = 0.5

    y = X_design @ true_w + np.random.randn(n) * noise_std

    # MLE solution (same as least squares)
    w_mle = np.linalg.lstsq(X_design, y, rcond=None)[0]

    # Closed-form OLS solution
    w_ols = np.linalg.inv(X_design.T @ X_design) @ X_design.T @ y

    print("Linear Regression: MLE = Least Squares")
    print("=" * 55)
    print(f"\nTrue weights: {true_w}")
    print(f"MLE weights:  {w_mle}")
    print(f"OLS weights:  {w_ols}")
    print(f"\nDifference:   {np.abs(w_mle - w_ols).max():.2e}")
    print("\nConclusion: Under Gaussian noise assumption,")
    print("            MLE gives the same answer as OLS.")

linear_regression_mle()

Properties of MLE

Consistency: As $n \to \infty$, $\hat{\theta}_{MLE} \to \theta_{true}$ (converges to truth).

Asymptotic Normality: For large $n$, $\hat{\theta}_{MLE}$ is approximately Gaussian:

$$\hat{\theta}_{MLE} \approx \mathcal{N}\left(\theta_{true}, \frac{1}{nI(\theta)}\right)$$
where $I(\theta)$ is the Fisher information.

Asymptotic Efficiency: MLE achieves the lowest possible variance among consistent estimators (the Cramér-Rao bound).

Invariance: If $\hat{\theta}_{MLE}$ is the MLE for $\theta$, then $g(\hat{\theta}_{MLE})$ is the MLE for $g(\theta)$.

These properties make MLE the go-to method when you have enough data and a well-specified model.

Limitations of MLE

Overfitting with Limited Data: MLE can overfit, especially with complex models and small datasets. The model may fit noise rather than signal.

Sensitivity to Outliers: Since we multiply likelihoods, a single outlier with very low probability can dominate the estimate.

Requires Model Specification: MLE finds the best parameters within the assumed model family. If the model is wrong, so are the estimates.

Point Estimates Only: MLE gives a single "best" value without quantifying uncertainty. For uncertainty, we need Bayesian methods or confidence intervals.

PYTHON
import numpy as np

def mle_overfitting_demo():
    """Demonstrate MLE overfitting with limited data."""

    np.random.seed(42)

    # True model: y = sin(x) + noise
    def true_function(x):
        return np.sin(x)

    # Generate small dataset
    n = 10
    X = np.linspace(0, 2*np.pi, n)
    y = true_function(X) + np.random.randn(n) * 0.3

    print("MLE Overfitting Demonstration")
    print("=" * 55)
    print(f"Data points: {n}")
    print()

    # Fit polynomials of increasing degree
    test_X = np.linspace(0, 2*np.pi, 100)

    for degree in [1, 3, 9]:
        coeffs = np.polyfit(X, y, degree)
        train_pred = np.polyval(coeffs, X)
        test_pred = np.polyval(coeffs, test_X)
        true_test = true_function(test_X)

        train_mse = np.mean((y - train_pred) ** 2)
        test_mse = np.mean((test_pred - true_test) ** 2)

        print(f"Degree {degree}: Train MSE = {train_mse:.4f}, "
              f"Test MSE = {test_mse:.4f}")

    print()
    print("Higher degree → Lower train error, but higher test error!")
    print("MLE maximally fits the training data, including noise.")

mle_overfitting_demo()

Regularization: Constraining MLE

To combat overfitting, we add regularization—a penalty on parameter complexity:

$$\hat{\theta} = \arg\max_\theta \left[\ell(\theta) - \lambda R(\theta)\right]$$

Common regularizers:

  • L2 (Ridge): <!--MATHBLOCK46--> — equivalent to Gaussian prior (see Bayesian section)
  • L1 (Lasso): <!--MATHBLOCK47--> — encourages sparsity

Regularized MLE sits between pure MLE and Bayesian inference: it incorporates prior beliefs (complexity should be limited) without full Bayesian machinery.

Key Takeaways

  • MLE finds parameters that maximize the probability of observed data
  • Log-likelihood is preferred for numerical stability and easier optimization
  • Common losses are MLEs: cross-entropy (classification), MSE (regression with Gaussian noise)
  • MLE is consistent and efficient asymptotically
  • MLE can overfit with limited data—regularization helps
  • MLE gives point estimates; for uncertainty quantification, use Bayesian methods
  • Understanding MLE connects probabilistic modeling to everyday loss functions

3.5 Bayesian Inference Basics Intermediate

Bayesian Inference

Maximum likelihood treats parameters as fixed unknowns to be estimated. Bayesian inference takes a fundamentally different view: parameters are random variables with probability distributions. This perspective naturally incorporates prior knowledge and provides uncertainty estimates, though at the cost of additional complexity.

The Bayesian Framework

Instead of finding a single "best" parameter, Bayesian inference maintains a probability distribution over possible parameter values. Given data $\mathcal{D}$, we update our beliefs using Bayes' theorem:

$$p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta) \cdot p(\theta)}{p(\mathcal{D})}$$

The components are:

  • Prior <!--MATHBLOCK5-->: Our beliefs about <!--MATHBLOCK6--> before seeing data
  • Likelihood <!--MATHBLOCK7-->: Probability of data given parameters (same as MLE)
  • Posterior <!--MATHBLOCK8-->: Updated beliefs after seeing data
  • Evidence <!--MATHBLOCK9-->: Normalizing constant

The evidence is often intractable to compute, but for many purposes we only need the posterior up to a constant:

$$p(\theta|\mathcal{D}) \propto p(\mathcal{D}|\theta) \cdot p(\theta)$$

Prior Distributions: Encoding Beliefs

The prior $p(\theta)$ encodes what we believe before seeing data. Choosing a prior is both powerful and controversial—it's where domain knowledge enters, but also where subjective bias can creep in.

Informative priors incorporate strong domain knowledge:

  • "Weights should be small" → Gaussian prior centered at 0
  • "This probability is likely around 0.3" → Beta(3, 7) prior

Weakly informative priors gently constrain without dominating:

  • "Parameters are probably not astronomically large" → Wide Gaussian

Uninformative priors try to "let the data speak":

  • Uniform over valid range
  • Jeffreys prior (invariant under reparameterization)

No prior is truly uninformative—even a uniform prior encodes assumptions. The key is being transparent about your choices.

PYTHON
import numpy as np

def prior_effect_demo():
    """Demonstrate how priors affect posterior inference."""

    np.random.seed(42)

    # Scenario: Estimating a coin's bias
    # True bias: 0.7
    # We flip 5 times, get 4 heads

    n_heads = 4
    n_flips = 5

    # Different priors
    priors = {
        'Uniform (no info)': (1, 1),       # Beta(1,1) = Uniform
        'Slight heads bias': (2, 1),        # Beta(2,1)
        'Fair coin belief': (10, 10),       # Beta(10,10) - strong prior at 0.5
    }

    print("Effect of Prior on Posterior")
    print("=" * 60)
    print(f"Data: {n_heads} heads in {n_flips} flips")
    print(f"MLE: {n_heads/n_flips:.2f}")
    print()

    theta = np.linspace(0, 1, 1000)

    for name, (a, b) in priors.items():
        # Prior: Beta(a, b)
        # Posterior: Beta(a + n_heads, b + n_flips - n_heads)
        a_post = a + n_heads
        b_post = b + (n_flips - n_heads)

        # Posterior mean
        post_mean = a_post / (a_post + b_post)

        # Posterior mode (MAP estimate)
        if a_post > 1 and b_post > 1:
            post_mode = (a_post - 1) / (a_post + b_post - 2)
        else:
            post_mode = float('nan')

        # Prior mean
        prior_mean = a / (a + b)

        print(f"{name}:")
        print(f"  Prior mean: {prior_mean:.3f}")
        print(f"  Posterior mean: {post_mean:.3f}")
        print(f"  Posterior mode (MAP): {post_mode:.3f}")
        print()

    print("Note: Strong prior (Fair coin) pulls estimate toward 0.5")
    print("      Weak prior (Uniform) yields estimate close to MLE")

prior_effect_demo()

Conjugate Priors

Computing the posterior analytically is often impossible. However, for certain likelihood-prior pairs, the posterior has the same functional form as the prior—these are called conjugate priors.

| Likelihood | Conjugate Prior | Posterior | |------------|-----------------|-----------| | Bernoulli/Binomial | Beta | Beta | | Gaussian (known variance) | Gaussian | Gaussian | | Poisson | Gamma | Gamma | | Multinomial | Dirichlet | Dirichlet |

Conjugacy makes computation tractable and provides interpretable updates.

Example: Beta-Binomial conjugacy

Prior: $\theta \sim \text{Beta}(\alpha, \beta)$ Likelihood: $k$ successes in $n$ trials Posterior: $\theta|\text{data} \sim \text{Beta}(\alpha + k, \beta + n - k)$

The prior parameters $\alpha$ and $\beta$ act like "pseudo-counts" of prior observations.

PYTHON
import numpy as np
from scipy import stats

def conjugate_prior_demo():
    """Demonstrate Beta-Binomial conjugacy."""

    print("Beta-Binomial Conjugacy")
    print("=" * 55)

    # Prior: Beta(2, 2) - slight preference for fair coin
    alpha_prior, beta_prior = 2, 2

    print(f"Prior: Beta({alpha_prior}, {beta_prior})")
    print(f"Prior mean: {alpha_prior/(alpha_prior+beta_prior):.3f}")
    print()

    # Sequential updating as we observe more data
    np.random.seed(42)
    true_p = 0.7

    alpha, beta = alpha_prior, beta_prior

    print(f"{'Flips':<8} {'Heads':<8} {'Posterior':<20} {'Mean':<8} {'95% CI':<20}")
    print("-" * 65)

    for n in [0, 1, 5, 10, 50, 100]:
        if n > 0:
            # Generate data
            data = np.random.binomial(1, true_p, n)
            heads = data.sum()

            # Update posterior
            alpha = alpha_prior + heads
            beta = beta_prior + (n - heads)

        else:
            heads = 0

        mean = alpha / (alpha + beta)
        ci_low, ci_high = stats.beta.ppf([0.025, 0.975], alpha, beta)

        print(f"{n:<8} {heads:<8} Beta({alpha:.0f}, {beta:.0f}){'':<8} "
              f"{mean:<8.3f} [{ci_low:.3f}, {ci_high:.3f}]")

    print()
    print(f"True probability: {true_p}")
    print("As data increases, posterior concentrates around true value")

conjugate_prior_demo()

MAP Estimation: Bridging MLE and Bayes

The Maximum A Posteriori (MAP) estimate is the mode of the posterior:

$$\hat{\theta}_{MAP} = \arg\max_\theta p(\theta|\mathcal{D}) = \arg\max_\theta \left[\log p(\mathcal{D}|\theta) + \log p(\theta)\right]$$

MAP sits between MLE and full Bayesian inference:

  • Simpler than full posterior (single point estimate)
  • Incorporates prior (unlike MLE)
  • Equivalent to regularized MLE

Connection to regularization:

  • Gaussian prior → L2 regularization (Ridge)
  • Laplace prior → L1 regularization (Lasso)

PYTHON
import numpy as np

def map_regularization_connection():
    """Show that MAP with Gaussian prior = L2 regularization."""

    np.random.seed(42)

    # Linear regression setup
    n = 50
    d = 20  # Many features, some irrelevant

    # True weights: only first 5 features matter
    true_w = np.zeros(d)
    true_w[:5] = np.array([3, -2, 1, 0.5, -1])

    X = np.random.randn(n, d)
    y = X @ true_w + np.random.randn(n) * 0.5

    # MLE solution (ordinary least squares)
    w_mle = np.linalg.lstsq(X, y, rcond=None)[0]

    # MAP with Gaussian prior N(0, 1/λ) = Ridge regression
    # Equivalent to minimizing ||y - Xw||² + λ||w||²
    lambda_reg = 1.0
    w_map = np.linalg.inv(X.T @ X + lambda_reg * np.eye(d)) @ X.T @ y

    print("MAP Estimation = Regularized MLE")
    print("=" * 55)
    print(f"Features: {d}, Samples: {n}")
    print(f"True non-zero weights: first 5 features")
    print()

    print(f"{'Method':<15} {'MSE to true':<15} {'||w||₂':<15}")
    print("-" * 45)
    print(f"{'True':<15} {0.0:<15.4f} {np.linalg.norm(true_w):<15.4f}")
    print(f"{'MLE':<15} {np.mean((w_mle - true_w)**2):<15.4f} "
          f"{np.linalg.norm(w_mle):<15.4f}")
    print(f"{'MAP (λ=1)':<15} {np.mean((w_map - true_w)**2):<15.4f} "
          f"{np.linalg.norm(w_map):<15.4f}")

    print()
    print("MAP (with Gaussian prior) shrinks weights toward zero,")
    print("reducing overfitting when features > samples")

map_regularization_connection()

Posterior Predictive Distribution

Instead of predicting with a single parameter estimate, Bayesian inference averages predictions over all plausible parameter values:

$$p(y_{new}|x_{new}, \mathcal{D}) = \int p(y_{new}|x_{new}, \theta) p(\theta|\mathcal{D}) d\theta$$

This posterior predictive distribution naturally incorporates parameter uncertainty into predictions. Areas where we're uncertain about parameters lead to wider predictive distributions.

Benefits:

  • Predictions include uncertainty from both noise and parameter uncertainty
  • More calibrated confidence intervals
  • Graceful handling of limited data

PYTHON
import numpy as np

def posterior_predictive_demo():
    """Demonstrate posterior predictive distribution."""

    np.random.seed(42)

    # Simple scenario: estimate mean of Gaussian with known variance
    true_mu = 5.0
    sigma = 2.0  # Known
    n_obs = 10

    # Generate observations
    data = np.random.normal(true_mu, sigma, n_obs)
    data_mean = data.mean()

    # Prior: mu ~ N(0, 10²)
    prior_mean = 0
    prior_var = 100

    # Posterior: mu | data ~ N(posterior_mean, posterior_var)
    posterior_var = 1 / (n_obs/sigma**2 + 1/prior_var)
    posterior_mean = posterior_var * (n_obs * data_mean / sigma**2 + prior_mean / prior_var)
    posterior_std = np.sqrt(posterior_var)

    # Predictive distribution for new observation
    # y_new ~ N(posterior_mean, sigma² + posterior_var)
    predictive_var = sigma**2 + posterior_var
    predictive_std = np.sqrt(predictive_var)

    print("Posterior Predictive Distribution")
    print("=" * 55)
    print(f"True μ = {true_mu}, σ = {sigma} (known)")
    print(f"Observed: {n_obs} points, sample mean = {data_mean:.3f}")
    print()

    print("Parameter Uncertainty:")
    print(f"  Posterior: μ ~ N({posterior_mean:.3f}, {posterior_std:.3f}²)")
    print(f"  95% CI for μ: [{posterior_mean - 1.96*posterior_std:.3f}, "
          f"{posterior_mean + 1.96*posterior_std:.3f}]")
    print()

    print("Predictive Uncertainty (for new observation):")
    print(f"  y_new ~ N({posterior_mean:.3f}, {predictive_std:.3f}²)")
    print(f"  95% PI: [{posterior_mean - 1.96*predictive_std:.3f}, "
          f"{posterior_mean + 1.96*predictive_std:.3f}]")
    print()

    print("Note: Predictive uncertainty > Parameter uncertainty")
    print("      because it includes observation noise")

posterior_predictive_demo()

Approximate Inference

For complex models, the posterior is intractable—we can't compute it analytically. Several approximation strategies exist:

Markov Chain Monte Carlo (MCMC): Generate samples from the posterior by constructing a Markov chain whose stationary distribution is the posterior. Algorithms include Metropolis-Hastings, Gibbs sampling, and Hamiltonian Monte Carlo.

Variational Inference: Approximate the posterior with a simpler distribution $q(\theta)$ by minimizing the KL divergence. Faster than MCMC but may miss posterior features.

Laplace Approximation: Approximate the posterior with a Gaussian centered at the MAP estimate. Simple but can be inaccurate for non-Gaussian posteriors.

These methods enable Bayesian deep learning, Gaussian processes, and other sophisticated probabilistic models.

PYTHON
import numpy as np

def mcmc_demo():
    """Simple Metropolis-Hastings MCMC demonstration."""

    np.random.seed(42)

    # Target: Posterior for Gaussian mean
    # Prior: mu ~ N(0, 10)
    # Likelihood: data ~ N(mu, 1)
    data = np.array([4.2, 5.1, 4.8, 5.5, 4.9])
    n = len(data)
    data_mean = data.mean()

    prior_mean, prior_std = 0, 10
    likelihood_std = 1

    def log_posterior(mu):
        log_prior = -0.5 * ((mu - prior_mean) / prior_std) ** 2
        log_likelihood = -0.5 * n * ((data_mean - mu) / likelihood_std) ** 2
        return log_prior + log_likelihood

    # Metropolis-Hastings
    n_samples = 10000
    samples = np.zeros(n_samples)
    current = 0.0
    proposal_std = 0.5

    accepted = 0
    for i in range(n_samples):
        # Propose new value
        proposed = current + np.random.normal(0, proposal_std)

        # Accept/reject
        log_alpha = log_posterior(proposed) - log_posterior(current)
        if np.log(np.random.random()) < log_alpha:
            current = proposed
            accepted += 1

        samples[i] = current

    # Discard burn-in
    samples = samples[1000:]

    print("MCMC (Metropolis-Hastings) Demo")
    print("=" * 55)
    print(f"Data: {data}")
    print(f"Acceptance rate: {accepted/n_samples:.1%}")
    print()

    # Compare to analytical posterior
    post_var = 1 / (n/likelihood_std**2 + 1/prior_std**2)
    post_mean = post_var * (n * data_mean / likelihood_std**2 + prior_mean / prior_std**2)
    post_std = np.sqrt(post_var)

    print(f"Analytical posterior: N({post_mean:.4f}, {post_std:.4f}²)")
    print(f"MCMC estimate:        N({samples.mean():.4f}, {samples.std():.4f}²)")
    print()
    print("MCMC successfully approximates the true posterior!")

mcmc_demo()

When to Use Bayesian Methods

Use Bayesian inference when:

  • You have prior knowledge worth incorporating
  • Uncertainty quantification is important
  • Data is limited
  • You need calibrated probabilities
  • Model comparison or selection is needed

Use MLE/frequentist methods when:

  • You have lots of data (priors wash out)
  • Computational resources are limited
  • You need simple, fast solutions
  • Interpretability of point estimates is sufficient
  • Key Takeaways

  • Bayesian inference treats parameters as random variables with probability distributions
  • Bayes' theorem updates prior beliefs to posterior beliefs given data
  • Priors encode domain knowledge; they can help or hurt depending on quality
  • Conjugate priors yield tractable posterior computations
  • MAP estimation is the posterior mode; equivalent to regularized MLE
  • Posterior predictive distributions incorporate parameter uncertainty into predictions
  • MCMC and variational inference enable Bayesian inference for complex models
  • Bayesian methods excel at uncertainty quantification and working with limited data