Advanced Expert 120 min read

Chapter 20: Diffusion Models

DDPM, latent diffusion, conditioning, and flow matching.

Libraries covered: PyTorch Diffusers

Learning Objectives

["Understand diffusion process", "Apply guidance techniques", "Use Diffusers library"]

20.1 Image Representation and Processing Beginner

Image Representation and Processing

Computer vision begins with understanding how machines perceive and manipulate visual information. Unlike human vision, which effortlessly interprets complex scenes through biological neural networks refined over millions of years of evolution, computer vision systems must work with discrete numerical representations of continuous visual phenomena. This foundational understanding of image representation forms the bedrock upon which all modern computer vision algorithms, from classical image processing to deep learning, are constructed.

Digital Image Fundamentals

A digital image is fundamentally a discrete sampling of a continuous visual signal, represented as a multidimensional array of numerical values called pixels (picture elements). Each pixel encodes the light intensity at a specific spatial location, and the collection of all pixels forms a grid that approximates the original scene. The resolution of an image, expressed as width × height, determines the level of spatial detail captured, while the bit depth determines how many distinct intensity levels each pixel can represent.

For grayscale images, each pixel contains a single value representing luminance, typically ranging from 0 (black) to 255 (white) in 8-bit representations. Color images extend this concept by using multiple channels, with each channel representing a different component of the color spectrum. The most common representation uses three channels for Red, Green, and Blue (RGB), where each pixel is described by a triplet of values that, when combined, produce the perceived color through additive color mixing.

PYTHON

import numpy as np
import torch
import torchvision.transforms as transforms
from PIL import Image
import matplotlib.pyplot as plt

# Creating a simple grayscale image from scratch
grayscale_image = np.zeros((100, 100), dtype=np.uint8)
grayscale_image[25:75, 25:75] = 255  # White square in center

# Creating an RGB image with different colored regions
rgb_image = np.zeros((100, 100, 3), dtype=np.uint8)
rgb_image[0:50, 0:50] = [255, 0, 0]    # Red quadrant (top-left)
rgb_image[0:50, 50:100] = [0, 255, 0]   # Green quadrant (top-right)
rgb_image[50:100, 0:50] = [0, 0, 255]   # Blue quadrant (bottom-left)
rgb_image[50:100, 50:100] = [255, 255, 0]  # Yellow quadrant (bottom-right)

# Understanding image dimensions and data types
print(f"Grayscale shape: {grayscale_image.shape}")  # (height, width)
print(f"RGB shape: {rgb_image.shape}")  # (height, width, channels)
print(f"Data type: {rgb_image.dtype}")
print(f"Value range: [{rgb_image.min()}, {rgb_image.max()}]")

# Converting between NumPy arrays and PIL Images
pil_image = Image.fromarray(rgb_image)
back_to_numpy = np.array(pil_image)

# PyTorch expects (C, H, W) format, not (H, W, C)
tensor_image = torch.from_numpy(rgb_image).permute(2, 0, 1).float() / 255.0
print(f"PyTorch tensor shape: {tensor_image.shape}")  # (channels, height, width)

The distinction between image formats is crucial for deep learning. NumPy and PIL use the convention (Height, Width, Channels), which matches how images are typically stored in files and displayed. However, PyTorch and most deep learning frameworks expect (Channels, Height, Width) ordering, which aligns better with how convolutional operations are implemented for computational efficiency. Understanding these conventions prevents subtle bugs that can be difficult to diagnose.

Color Spaces and Transformations

While RGB is the dominant color space for display and storage, alternative color spaces often prove more useful for specific computer vision tasks. The choice of color space can significantly impact algorithm performance, as different representations emphasize different aspects of visual information.

The HSV (Hue, Saturation, Value) color space separates chromatic content from intensity information. Hue represents the pure color as an angle on the color wheel (0-360 degrees), saturation measures the purity or intensity of the color (0-100%), and value indicates the brightness (0-100%). This separation makes HSV particularly useful for color-based segmentation, as objects can be identified by their hue regardless of lighting conditions that primarily affect the value channel.

PYTHON

import cv2
import numpy as np
import torch

def explore_color_spaces(image_rgb):
    """
    Demonstrate conversion between different color spaces
    and their properties for computer vision tasks.
    """
    # Ensure image is in correct format for OpenCV (BGR)
    image_bgr = cv2.cvtColor(image_rgb, cv2.COLOR_RGB2BGR)

    # Convert to various color spaces
    image_hsv = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2HSV)
    image_lab = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2LAB)
    image_gray = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2GRAY)
    image_ycrcb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2YCrCb)

    # HSV allows easy color-based filtering
    # Example: Create mask for red objects
    lower_red1 = np.array([0, 100, 100])
    upper_red1 = np.array([10, 255, 255])
    lower_red2 = np.array([160, 100, 100])
    upper_red2 = np.array([180, 255, 255])

    mask1 = cv2.inRange(image_hsv, lower_red1, upper_red1)
    mask2 = cv2.inRange(image_hsv, lower_red2, upper_red2)
    red_mask = mask1 | mask2

    # LAB color space: L=lightness, A=green-red, B=blue-yellow
    # Useful for perceptual color differences
    L, A, B = cv2.split(image_lab)

    return {
        'hsv': image_hsv,
        'lab': image_lab,
        'gray': image_gray,
        'ycrcb': image_ycrcb,
        'red_mask': red_mask
    }

# Color space conversion for neural network preprocessing
class ColorSpaceTransform:
    """Custom transform for PyTorch that handles color space conversion."""

    def __init__(self, target_space='rgb'):
        self.target_space = target_space

    def __call__(self, image):
        if isinstance(image, torch.Tensor):
            # Assume (C, H, W) format, convert to numpy (H, W, C)
            image_np = image.permute(1, 2, 0).numpy()
            image_np = (image_np * 255).astype(np.uint8)
        else:
            image_np = np.array(image)

        if self.target_space == 'gray':
            result = cv2.cvtColor(image_np, cv2.COLOR_RGB2GRAY)
            result = np.expand_dims(result, axis=-1)
        elif self.target_space == 'hsv':
            result = cv2.cvtColor(image_np, cv2.COLOR_RGB2HSV)
        elif self.target_space == 'lab':
            result = cv2.cvtColor(image_np, cv2.COLOR_RGB2LAB)
        else:
            result = image_np

        # Convert back to tensor (C, H, W)
        tensor = torch.from_numpy(result).float() / 255.0
        if tensor.dim() == 2:
            tensor = tensor.unsqueeze(0)
        else:
            tensor = tensor.permute(2, 0, 1)
        return tensor

The LAB color space, designed to approximate human vision, represents colors in terms of perceptual lightness (L) and two chromatic components (A for green-red and B for blue-yellow). This space is particularly valuable when computing color differences, as Euclidean distance in LAB space correlates well with perceived color similarity, unlike RGB where equal numerical differences may not appear equally different to human observers.

Convolution and Filtering Operations

Convolution is the fundamental operation underlying both classical image processing and modern convolutional neural networks. In image processing, convolution applies a small matrix called a kernel or filter to every position in an image, computing a weighted sum of the pixel values covered by the kernel. This operation enables edge detection, blurring, sharpening, and countless other transformations depending on the kernel values.

The mathematical definition of 2D convolution for a kernel $$K$$ applied to an image $$I$$ at position $$(i, j)$$ is:

(I * K)[i, j] = \sum_{m} \sum_{n} I[i-m, j-n] \cdot K[m, n]

In practice, deep learning frameworks implement cross-correlation rather than true convolution (which would require flipping the kernel), but the distinction rarely matters since learned kernels can adapt to either convention.

PYTHON

import torch
import torch.nn.functional as F
import numpy as np

def demonstrate_convolution_operations():
    """
    Show how convolution kernels transform images
    and their relationship to neural network convolutions.
    """
    # Create a simple test image
    test_image = torch.zeros(1, 1, 8, 8)
    test_image[0, 0, 2:6, 2:6] = 1.0  # White square

    # Define classic image processing kernels
    kernels = {
        'identity': torch.tensor([[0, 0, 0],
                                   [0, 1, 0],
                                   [0, 0, 0]], dtype=torch.float32),

        'edge_detect': torch.tensor([[-1, -1, -1],
                                      [-1,  8, -1],
                                      [-1, -1, -1]], dtype=torch.float32),

        'sobel_x': torch.tensor([[-1, 0, 1],
                                  [-2, 0, 2],
                                  [-1, 0, 1]], dtype=torch.float32),

        'sobel_y': torch.tensor([[-1, -2, -1],
                                  [ 0,  0,  0],
                                  [ 1,  2,  1]], dtype=torch.float32),

        'gaussian_blur': torch.tensor([[1, 2, 1],
                                        [2, 4, 2],
                                        [1, 2, 1]], dtype=torch.float32) / 16,

        'sharpen': torch.tensor([[ 0, -1,  0],
                                  [-1,  5, -1],
                                  [ 0, -1,  0]], dtype=torch.float32)
    }

    # Apply each kernel using PyTorch convolution
    results = {}
    for name, kernel in kernels.items():
        # Reshape kernel to (out_channels, in_channels, H, W)
        kernel_4d = kernel.view(1, 1, 3, 3)
        # Apply convolution with same padding
        output = F.conv2d(test_image, kernel_4d, padding=1)
        results[name] = output
        print(f"{name}: output range [{output.min():.3f}, {output.max():.3f}]")

    return results

def edge_detection_pipeline(image_tensor):
    """
    Complete edge detection using Sobel operators,
    demonstrating gradient magnitude and direction computation.
    """
    # Sobel kernels for x and y gradients
    sobel_x = torch.tensor([[-1, 0, 1],
                            [-2, 0, 2],
                            [-1, 0, 1]], dtype=torch.float32).view(1, 1, 3, 3)

    sobel_y = torch.tensor([[-1, -2, -1],
                            [ 0,  0,  0],
                            [ 1,  2,  1]], dtype=torch.float32).view(1, 1, 3, 3)

    # Compute gradients
    if image_tensor.dim() == 3:
        image_tensor = image_tensor.unsqueeze(0)  # Add batch dimension

    # Convert to grayscale if RGB
    if image_tensor.shape[1] == 3:
        gray = 0.299 * image_tensor[:, 0:1] + 0.587 * image_tensor[:, 1:2] + 0.114 * image_tensor[:, 2:3]
    else:
        gray = image_tensor

    grad_x = F.conv2d(gray, sobel_x, padding=1)
    grad_y = F.conv2d(gray, sobel_y, padding=1)

    # Gradient magnitude: sqrt(Gx^2 + Gy^2)
    magnitude = torch.sqrt(grad_x ** 2 + grad_y ** 2)

    # Gradient direction: atan2(Gy, Gx)
    direction = torch.atan2(grad_y, grad_x)

    return magnitude, direction, grad_x, grad_y

# Demonstrate the relationship between convolution and pooling
class ConvolutionalFeatureExtractor(torch.nn.Module):
    """
    Simple feature extractor showing how convolution
    and pooling work together in neural networks.
    """
    def __init__(self):
        super().__init__()
        # Learnable convolution (unlike fixed kernels above)
        self.conv1 = torch.nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.conv2 = torch.nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.pool = torch.nn.MaxPool2d(2, 2)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        # First conv block: (B, 3, H, W) -> (B, 16, H/2, W/2)
        x = self.relu(self.conv1(x))
        x = self.pool(x)

        # Second conv block: (B, 16, H/2, W/2) -> (B, 32, H/4, W/4)
        x = self.relu(self.conv2(x))
        x = self.pool(x)

        return x

Understanding these classical operations provides crucial intuition for deep learning: the early layers of trained CNNs often learn filters resembling Sobel operators and Gabor filters, while deeper layers compose these simple features into increasingly abstract representations. The key insight is that neural networks learn these kernels from data rather than requiring manual design.

Image Preprocessing for Deep Learning

Proper preprocessing is essential for training effective computer vision models. Raw images exhibit enormous variation in size, scale, color distribution, and lighting conditions. Preprocessing normalizes these variations, creating consistent inputs that enable neural networks to learn robust features rather than memorizing dataset-specific artifacts.

PYTHON

import torch
import torchvision.transforms as T
from torchvision.transforms import functional as TF
import numpy as np
from PIL import Image

def create_training_transforms(image_size=224):
    """
    Standard preprocessing pipeline for training vision models,
    including data augmentation for improved generalization.
    """
    train_transform = T.Compose([
        # Resize to consistent size (with some random cropping)
        T.RandomResizedCrop(image_size, scale=(0.8, 1.0)),

        # Geometric augmentations
        T.RandomHorizontalFlip(p=0.5),
        T.RandomRotation(degrees=15),

        # Color augmentations
        T.ColorJitter(
            brightness=0.2,
            contrast=0.2,
            saturation=0.2,
            hue=0.1
        ),

        # Random erasing for regularization
        T.ToTensor(),  # Converts to (C, H, W) and scales to [0, 1]

        # Normalize using ImageNet statistics (common standard)
        T.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        ),

        T.RandomErasing(p=0.1)
    ])

    # Validation/test transforms (no augmentation)
    val_transform = T.Compose([
        T.Resize(int(image_size * 1.14)),  # Slightly larger
        T.CenterCrop(image_size),           # Then crop center
        T.ToTensor(),
        T.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])

    return train_transform, val_transform

class AdvancedPreprocessing:
    """
    Custom preprocessing with techniques beyond standard transforms.
    """

    @staticmethod
    def histogram_equalization(image_tensor):
        """
        Enhance contrast using histogram equalization.
        Useful for images with poor lighting.
        """
        # Convert to numpy for processing
        img_np = (image_tensor.permute(1, 2, 0).numpy() * 255).astype(np.uint8)

        # Apply CLAHE (Contrast Limited Adaptive Histogram Equalization)
        import cv2
        lab = cv2.cvtColor(img_np, cv2.COLOR_RGB2LAB)
        l, a, b = cv2.split(lab)

        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
        l_enhanced = clahe.apply(l)

        enhanced_lab = cv2.merge([l_enhanced, a, b])
        enhanced_rgb = cv2.cvtColor(enhanced_lab, cv2.COLOR_LAB2RGB)

        return torch.from_numpy(enhanced_rgb).permute(2, 0, 1).float() / 255.0

    @staticmethod
    def mixup(image1, label1, image2, label2, alpha=0.2):
        """
        MixUp augmentation: blend two images and their labels.
        Improves model robustness and calibration.
        """
        lambda_param = np.random.beta(alpha, alpha)
        mixed_image = lambda_param * image1 + (1 - lambda_param) * image2
        mixed_label = lambda_param * label1 + (1 - lambda_param) * label2
        return mixed_image, mixed_label

    @staticmethod
    def cutmix(image1, label1, image2, label2, alpha=1.0):
        """
        CutMix augmentation: replace region with patch from another image.
        Helps model focus on multiple discriminative regions.
        """
        lambda_param = np.random.beta(alpha, alpha)

        _, H, W = image1.shape

        # Calculate cut dimensions
        cut_ratio = np.sqrt(1 - lambda_param)
        cut_h = int(H * cut_ratio)
        cut_w = int(W * cut_ratio)

        # Random center point
        cy = np.random.randint(H)
        cx = np.random.randint(W)

        # Bounding box
        y1 = np.clip(cy - cut_h // 2, 0, H)
        y2 = np.clip(cy + cut_h // 2, 0, H)
        x1 = np.clip(cx - cut_w // 2, 0, W)
        x2 = np.clip(cx + cut_w // 2, 0, W)

        # Create mixed image
        mixed_image = image1.clone()
        mixed_image[:, y1:y2, x1:x2] = image2[:, y1:y2, x1:x2]

        # Adjust lambda based on actual cut area
        actual_lambda = 1 - (y2 - y1) * (x2 - x1) / (H * W)
        mixed_label = actual_lambda * label1 + (1 - actual_lambda) * label2

        return mixed_image, mixed_label

# Demonstrate normalization importance
def visualize_normalization_effect():
    """
    Show why proper normalization matters for neural networks.
    """
    # Simulated image batch with different intensity ranges
    batch = torch.rand(4, 3, 32, 32)
    batch[0] *= 0.3  # Very dark image
    batch[1] = batch[1] * 0.5 + 0.5  # Mid-range
    batch[2] *= 0.8 + 0.2  # Bright image
    batch[3] = batch[3]  # Normal range

    # Before normalization: inconsistent statistics
    print("Before normalization:")
    for i in range(4):
        print(f"  Image {i}: mean={batch[i].mean():.3f}, std={batch[i].std():.3f}")

    # After standard normalization
    normalize = T.Normalize(mean=[0.485, 0.456, 0.406],
                           std=[0.229, 0.224, 0.225])
    normalized = torch.stack([normalize(img) for img in batch])

    print("\nAfter ImageNet normalization:")
    for i in range(4):
        print(f"  Image {i}: mean={normalized[i].mean():.3f}, std={normalized[i].std():.3f}")

The normalization using ImageNet statistics (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) has become a de facto standard, even for models trained on different datasets. This convention emerged because many pretrained models were initially trained on ImageNet, and maintaining consistent statistics enables effective transfer learning. When training from scratch on significantly different data, computing dataset-specific statistics may yield better results.

Image Data Loading and Batching

Efficient data loading is critical for training deep learning models, as GPU computation often outpaces data preparation. PyTorch's DataLoader architecture enables parallel data loading, preprocessing, and batching, keeping the GPU fully utilized during training.

PYTHON

import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets
import os
from PIL import Image
from pathlib import Path

class CustomImageDataset(Dataset):
    """
    Custom dataset class demonstrating best practices
    for loading and preprocessing images.
    """

    def __init__(self, image_dir, transform=None, file_extensions=('.jpg', '.png', '.jpeg')):
        self.image_dir = Path(image_dir)
        self.transform = transform

        # Find all images recursively
        self.image_paths = []
        self.labels = []

        # Assume directory structure: image_dir/class_name/image.jpg
        for class_idx, class_dir in enumerate(sorted(self.image_dir.iterdir())):
            if class_dir.is_dir():
                for img_path in class_dir.iterdir():
                    if img_path.suffix.lower() in file_extensions:
                        self.image_paths.append(img_path)
                        self.labels.append(class_idx)

        self.classes = sorted([d.name for d in self.image_dir.iterdir() if d.is_dir()])
        print(f"Found {len(self.image_paths)} images in {len(self.classes)} classes")

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        label = self.labels[idx]

        # Load image and convert to RGB (handles grayscale/RGBA)
        image = Image.open(img_path).convert('RGB')

        if self.transform:
            image = self.transform(image)

        return image, label

def create_data_loaders(train_dir, val_dir, batch_size=32, num_workers=4):
    """
    Create optimized data loaders for training and validation.
    """
    train_transform, val_transform = create_training_transforms()

    train_dataset = CustomImageDataset(train_dir, transform=train_transform)
    val_dataset = CustomImageDataset(val_dir, transform=val_transform)

    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,           # Randomize order each epoch
        num_workers=num_workers, # Parallel data loading
        pin_memory=True,        # Faster GPU transfer
        drop_last=True,         # Consistent batch size
        prefetch_factor=2       # Prefetch batches
    )

    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,          # Keep order for reproducibility
        num_workers=num_workers,
        pin_memory=True
    )

    return train_loader, val_loader

# Using torchvision's built-in datasets
def load_standard_datasets():
    """
    Load common benchmark datasets with appropriate transforms.
    """
    transform = T.Compose([
        T.ToTensor(),
        T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

    # CIFAR-10: 60,000 32x32 color images in 10 classes
    cifar10_train = datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transform
    )

    # MNIST: 70,000 28x28 grayscale handwritten digits
    mnist_transform = T.Compose([
        T.ToTensor(),
        T.Normalize((0.1307,), (0.3081,))  # MNIST-specific stats
    ])
    mnist_train = datasets.MNIST(
        root='./data', train=True, download=True, transform=mnist_transform
    )

    return cifar10_train, mnist_train

Key Takeaways

Image representation and processing form the essential foundation for all computer vision work. Digital images exist as multidimensional arrays where spatial organization encodes visual structure and numerical values encode color and intensity information. Different color spaces offer distinct advantages for various tasks, with RGB dominating storage and display while HSV and LAB provide better separation of color and intensity for processing algorithms. Convolution operations, whether hand-designed filters like Sobel operators or learned kernels in neural networks, transform images by computing local weighted sums that detect patterns and features. Proper preprocessing through resizing, normalization, and augmentation creates consistent inputs that enable models to learn robust representations invariant to irrelevant variations. Efficient data loading through parallel processing and GPU memory optimization is crucial for training performance. These fundamental concepts appear throughout modern computer vision, from classical algorithms to state-of-the-art deep learning architectures.

20.2 CNN Architectures for Vision Intermediate

CNN Architectures for Vision

The history of convolutional neural network architectures charts the evolution of deep learning from theoretical promise to practical dominance in computer vision. Each landmark architecture introduced novel concepts that addressed fundamental challenges in training deep networks, improving feature extraction, or balancing accuracy with computational efficiency. Understanding these architectures provides not only historical context but also practical design principles that inform modern network construction.

The LeNet Era: Foundations of Convolutional Networks

LeNet-5, developed by Yann LeCun and colleagues in 1998, established the foundational architecture for convolutional neural networks. Designed for handwritten digit recognition on the MNIST dataset, LeNet demonstrated that networks could learn hierarchical feature representations directly from pixels without hand-engineered features. While modest by modern standards, its core architectural pattern of alternating convolutional and pooling layers followed by fully connected layers remains the template for modern designs.

PYTHON

import torch
import torch.nn as nn
import torch.nn.functional as F

class LeNet5(nn.Module):
    """
    Original LeNet-5 architecture (1998).
    Input: 32x32 grayscale images.
    Notable features: First successful deep CNN for image classification.
    """

    def __init__(self, num_classes=10):
        super(LeNet5, self).__init__()
        # Layer C1: 6 feature maps of size 28x28
        self.conv1 = nn.Conv2d(1, 6, kernel_size=5)
        # Layer S2: Subsampling (pooling) to 14x14
        self.pool1 = nn.AvgPool2d(kernel_size=2, stride=2)
        # Layer C3: 16 feature maps of size 10x10
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
        # Layer S4: Subsampling to 5x5
        self.pool2 = nn.AvgPool2d(kernel_size=2, stride=2)
        # Layer C5: 120 feature maps of size 1x1
        self.conv3 = nn.Conv2d(16, 120, kernel_size=5)
        # Fully connected layers
        self.fc1 = nn.Linear(120, 84)
        self.fc2 = nn.Linear(84, num_classes)

    def forward(self, x):
        # Conv -> Activation -> Pool pattern
        x = torch.tanh(self.conv1(x))  # Original used tanh
        x = self.pool1(x)
        x = torch.tanh(self.conv2(x))
        x = self.pool2(x)
        x = torch.tanh(self.conv3(x))
        # Flatten and fully connected
        x = x.view(x.size(0), -1)
        x = torch.tanh(self.fc1(x))
        x = self.fc2(x)
        return x

# Demonstrate feature map dimensions
def trace_lenet_dimensions():
    model = LeNet5()
    x = torch.randn(1, 1, 32, 32)  # Batch of 1, 1 channel, 32x32

    print("LeNet-5 Feature Map Dimensions:")
    print(f"Input: {x.shape}")

    x = model.conv1(x)
    print(f"After conv1 (5x5, 6 filters): {x.shape}")

    x = model.pool1(x)
    print(f"After pool1 (2x2): {x.shape}")

    x = model.conv2(x)
    print(f"After conv2 (5x5, 16 filters): {x.shape}")

    x = model.pool2(x)
    print(f"After pool2 (2x2): {x.shape}")

    x = model.conv3(x)
    print(f"After conv3 (5x5, 120 filters): {x.shape}")

trace_lenet_dimensions()

The key insight of LeNet was that spatial hierarchies matter: early layers detect simple patterns like edges and corners, while deeper layers compose these into increasingly complex features. This hierarchical feature learning, combined with weight sharing through convolution and translation invariance through pooling, remains the core principle of all CNN architectures.

AlexNet: The Deep Learning Revolution

AlexNet, which won the 2012 ImageNet competition by a significant margin, marked the beginning of the deep learning revolution in computer vision. Its success came from combining several innovations: deeper networks enabled by GPU training, ReLU activations for faster training, dropout for regularization, and data augmentation for improved generalization. The network demonstrated that scale—more data, more parameters, more compute—could achieve previously impossible results.

PYTHON

class AlexNet(nn.Module):
    """
    AlexNet architecture (2012).
    Input: 227x227 RGB images.
    Key innovations: ReLU, dropout, GPU training, data augmentation.
    """

    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()

        self.features = nn.Sequential(
            # Layer 1: 96 kernels of size 11x11, stride 4
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),

            # Layer 2: 256 kernels of size 5x5
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),

            # Layer 3: 384 kernels of size 3x3
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),

            # Layer 4: 384 kernels of size 3x3
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),

            # Layer 5: 256 kernels of size 3x3
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )

        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x

# AlexNet's impact was in demonstrating scale
def count_parameters(model):
    """Count trainable parameters in a model."""
    total = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total

model = AlexNet()
print(f"AlexNet parameters: {count_parameters(model):,}")  # ~62 million

AlexNet's use of ReLU activations instead of tanh or sigmoid was particularly significant. ReLU's simple formulation ( $f(x) = \max(0, x)$ ) solved the vanishing gradient problem that had plagued earlier deep networks, enabling the training of networks with many more layers. This seemingly minor change was transformative for the field.

VGGNet: Simplicity Through Depth

VGGNet (2014) from the Visual Geometry Group at Oxford demonstrated that network depth was crucial for representational power. By using only 3×3 convolutional filters throughout—the smallest size that captures the notion of left/right, up/down, center—VGGNet showed that stacking small filters achieved the same receptive field as larger filters while introducing more non-linearities and requiring fewer parameters.

PYTHON

class VGGBlock(nn.Module):
    """A VGG-style block: multiple 3x3 convolutions followed by pooling."""

    def __init__(self, in_channels, out_channels, num_convs):
        super(VGGBlock, self).__init__()
        layers = []
        for i in range(num_convs):
            layers.append(
                nn.Conv2d(
                    in_channels if i == 0 else out_channels,
                    out_channels,
                    kernel_size=3,
                    padding=1
                )
            )
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
        self.block = nn.Sequential(*layers)

    def forward(self, x):
        return self.block(x)


class VGG16(nn.Module):
    """
    VGG-16 architecture (2014).
    Input: 224x224 RGB images.
    Key insight: Depth matters; use only 3x3 convolutions.
    """

    def __init__(self, num_classes=1000):
        super(VGG16, self).__init__()

        # Configuration: (out_channels, num_convs)
        self.features = nn.Sequential(
            VGGBlock(3, 64, 2),      # 224 -> 112
            VGGBlock(64, 128, 2),    # 112 -> 56
            VGGBlock(128, 256, 3),   # 56 -> 28
            VGGBlock(256, 512, 3),   # 28 -> 14
            VGGBlock(512, 512, 3),   # 14 -> 7
        )

        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

# Why 3x3 convolutions? Two 3x3 convs have same receptive field as one 5x5
# but with more non-linearity and fewer parameters
def compare_receptive_fields():
    """
    Demonstrate that stacked small filters
    achieve large receptive fields efficiently.
    """
    # Single 5x5 conv: 5x5 = 25 parameters (per channel pair)
    # Two 3x3 convs: 3x3 + 3x3 = 18 parameters (per channel pair)

    # Receptive field calculation
    # After one 3x3: 3x3 region
    # After two 3x3: 5x5 region (each position sees 3x3 of previous layer)
    # After three 3x3: 7x7 region

    print("Receptive Field Comparison:")
    print("One 5x5 conv: RF=5x5, params=25 per channel")
    print("Two 3x3 conv: RF=5x5, params=18 per channel, 2 ReLUs")
    print("Three 3x3 conv: RF=7x7, params=27 per channel, 3 ReLUs")
    print("One 7x7 conv: RF=7x7, params=49 per channel, 1 ReLU")

compare_receptive_fields()

VGGNet's elegant simplicity made it highly influential and remains popular for feature extraction. However, with 138 million parameters (mostly in fully connected layers), it demonstrated the limitations of simply adding more layers: training becomes increasingly difficult, and returns diminish.

GoogLeNet/Inception: Width and Efficiency

GoogLeNet (2014) introduced the Inception module, which processes input through multiple parallel pathways with different filter sizes, then concatenates the results. This multi-scale processing captures features at various spatial resolutions simultaneously. Crucially, GoogLeNet used 1×1 convolutions for dimensionality reduction, dramatically reducing computational cost while maintaining accuracy.

PYTHON

class InceptionModule(nn.Module):
    """
    Inception module from GoogLeNet.
    Processes input through multiple parallel pathways.
    """

    def __init__(self, in_channels, ch1x1, ch3x3_reduce, ch3x3,
                 ch5x5_reduce, ch5x5, pool_proj):
        super(InceptionModule, self).__init__()

        # 1x1 convolution branch
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, ch1x1, kernel_size=1),
            nn.ReLU(inplace=True)
        )

        # 1x1 -> 3x3 convolution branch
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, ch3x3_reduce, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch3x3_reduce, ch3x3, kernel_size=3, padding=1),
            nn.ReLU(inplace=True)
        )

        # 1x1 -> 5x5 convolution branch
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, ch5x5_reduce, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch5x5_reduce, ch5x5, kernel_size=5, padding=2),
            nn.ReLU(inplace=True)
        )

        # Max pool -> 1x1 convolution branch
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, pool_proj, kernel_size=1),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        branch1_out = self.branch1(x)
        branch2_out = self.branch2(x)
        branch3_out = self.branch3(x)
        branch4_out = self.branch4(x)

        # Concatenate along channel dimension
        return torch.cat([branch1_out, branch2_out, branch3_out, branch4_out], dim=1)


def inception_dimensionality_reduction():
    """
    Demonstrate how 1x1 convolutions reduce computation.
    """
    # Without 1x1 reduction:
    # 256 channels -> 5x5 conv -> 256 channels
    # Operations: 256 * 5 * 5 * 256 = 1,638,400 per spatial position

    # With 1x1 reduction:
    # 256 channels -> 1x1 conv -> 64 channels -> 5x5 conv -> 256 channels
    # Operations: 256 * 1 * 1 * 64 + 64 * 5 * 5 * 256 = 16,384 + 409,600 = 425,984

    print("Computational Savings from 1x1 Convolutions:")
    without_reduction = 256 * 5 * 5 * 256
    with_reduction = (256 * 1 * 1 * 64) + (64 * 5 * 5 * 256)
    savings = (1 - with_reduction / without_reduction) * 100

    print(f"Without 1x1: {without_reduction:,} operations/position")
    print(f"With 1x1 bottleneck: {with_reduction:,} operations/position")
    print(f"Reduction: {savings:.1f}%")

inception_dimensionality_reduction()

The Inception module's design philosophy—processing features at multiple scales and using bottleneck layers for efficiency—influenced many subsequent architectures. Later versions (Inception v2, v3, v4) refined these ideas with batch normalization, factorized convolutions, and grid reduction strategies.

ResNet: Skip Connections and Deep Networks

ResNet (2015) introduced skip connections that allow information to flow directly through the network, bypassing convolutional layers. This simple but profound change enabled training networks of unprecedented depth—up to 152 layers in the original paper, and over 1000 in subsequent experiments. The key insight was reformulating the learning problem: instead of learning a direct mapping $$H(x)$$ , residual blocks learn the residual $$F(x) = H(x) - x$$ , making it easy for layers to learn identity mappings when needed.

PYTHON

class ResidualBlock(nn.Module):
    """
    Basic residual block with skip connection.
    Core building block of ResNet-18 and ResNet-34.
    """

    def __init__(self, in_channels, out_channels, stride=1):
        super(ResidualBlock, self).__init__()

        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        # Skip connection with projection if dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = self.shortcut(x)

        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        # Add skip connection
        out += identity
        out = F.relu(out)
        return out


class BottleneckBlock(nn.Module):
    """
    Bottleneck residual block with 1x1-3x3-1x1 structure.
    Used in ResNet-50, ResNet-101, ResNet-152.
    """

    expansion = 4  # Output channels = 4 * base channels

    def __init__(self, in_channels, base_channels, stride=1):
        super(BottleneckBlock, self).__init__()

        out_channels = base_channels * self.expansion

        # 1x1 reduce
        self.conv1 = nn.Conv2d(in_channels, base_channels, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(base_channels)

        # 3x3 convolution
        self.conv2 = nn.Conv2d(base_channels, base_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(base_channels)

        # 1x1 expand
        self.conv3 = nn.Conv2d(base_channels, out_channels, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                         stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )

    def forward(self, x):
        identity = self.shortcut(x)

        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))

        out += identity
        out = F.relu(out)
        return out


class ResNet(nn.Module):
    """
    Full ResNet implementation supporting various depths.
    """

    def __init__(self, block, layers, num_classes=1000):
        super(ResNet, self).__init__()
        self.in_channels = 64

        # Initial convolution
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # Residual layers
        self.layer1 = self._make_layer(block, 64, layers[0], stride=1)
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        # Global average pooling and classifier
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

    def _make_layer(self, block, base_channels, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        for stride in strides:
            if block == ResidualBlock:
                out_channels = base_channels
                layers.append(block(self.in_channels, out_channels, stride))
                self.in_channels = out_channels
            else:
                layers.append(block(self.in_channels, base_channels, stride))
                self.in_channels = base_channels * block.expansion
        return nn.Sequential(*layers)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x


# Factory functions for different ResNet variants
def resnet18(num_classes=1000):
    return ResNet(ResidualBlock, [2, 2, 2, 2], num_classes)

def resnet34(num_classes=1000):
    return ResNet(ResidualBlock, [3, 4, 6, 3], num_classes)

def resnet50(num_classes=1000):
    # Fix: BottleneckBlock needs expansion attribute
    BottleneckBlock.expansion = 4
    return ResNet(BottleneckBlock, [3, 4, 6, 3], num_classes)

def resnet101(num_classes=1000):
    BottleneckBlock.expansion = 4
    return ResNet(BottleneckBlock, [3, 4, 23, 3], num_classes)

# Compare ResNet variants
for name, factory in [('ResNet-18', resnet18), ('ResNet-34', resnet34),
                      ('ResNet-50', resnet50), ('ResNet-101', resnet101)]:
    model = factory()
    params = sum(p.numel() for p in model.parameters())
    print(f"{name}: {params:,} parameters")

ResNet's skip connections solve the degradation problem where deeper networks paradoxically performed worse than shallower ones. By providing a direct gradient path through the network, skip connections enable efficient training of very deep networks. This architectural innovation became standard in subsequent CNN designs.

EfficientNet: Neural Architecture Search and Scaling

EfficientNet (2019) used neural architecture search (NAS) to discover an optimal base architecture, then introduced compound scaling to systematically scale depth, width, and resolution together. This principled approach achieved state-of-the-art accuracy with significantly fewer parameters and computations than previous models.

PYTHON

class SEBlock(nn.Module):
    """
    Squeeze-and-Excitation block for channel attention.
    Key component of EfficientNet's MBConv blocks.
    """

    def __init__(self, channels, reduction=4):
        super(SEBlock, self).__init__()
        reduced = max(1, channels // reduction)
        self.squeeze = nn.AdaptiveAvgPool2d(1)
        self.excitation = nn.Sequential(
            nn.Linear(channels, reduced, bias=False),
            nn.SiLU(),  # Swish activation
            nn.Linear(reduced, channels, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.shape
        # Squeeze: global average pooling
        se = self.squeeze(x).view(b, c)
        # Excitation: FC -> SiLU -> FC -> Sigmoid
        se = self.excitation(se).view(b, c, 1, 1)
        # Scale: multiply original features by channel weights
        return x * se


class MBConvBlock(nn.Module):
    """
    Mobile Inverted Bottleneck Conv block.
    Core building block of EfficientNet.
    """

    def __init__(self, in_channels, out_channels, expand_ratio, stride,
                 kernel_size=3, se_ratio=0.25):
        super(MBConvBlock, self).__init__()
        self.stride = stride
        self.use_residual = (stride == 1 and in_channels == out_channels)

        hidden_dim = in_channels * expand_ratio

        layers = []

        # Expansion phase (if expand_ratio > 1)
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(in_channels, hidden_dim, 1, bias=False),
                nn.BatchNorm2d(hidden_dim),
                nn.SiLU()
            ])

        # Depthwise convolution
        layers.extend([
            nn.Conv2d(hidden_dim, hidden_dim, kernel_size,
                     stride=stride, padding=kernel_size // 2,
                     groups=hidden_dim, bias=False),
            nn.BatchNorm2d(hidden_dim),
            nn.SiLU()
        ])

        self.conv = nn.Sequential(*layers)

        # Squeeze-and-excitation
        self.se = SEBlock(hidden_dim, int(1 / se_ratio))

        # Projection phase
        self.project = nn.Sequential(
            nn.Conv2d(hidden_dim, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        )

    def forward(self, x):
        identity = x

        out = self.conv(x)
        out = self.se(out)
        out = self.project(out)

        if self.use_residual:
            out = out + identity
        return out


class EfficientNetB0(nn.Module):
    """
    EfficientNet-B0 base architecture.
    Discovered via Neural Architecture Search.
    """

    def __init__(self, num_classes=1000):
        super(EfficientNetB0, self).__init__()

        # Stage configuration: [expand_ratio, channels, num_blocks, stride, kernel]
        config = [
            [1, 16, 1, 1, 3],   # Stage 1
            [6, 24, 2, 2, 3],   # Stage 2
            [6, 40, 2, 2, 5],   # Stage 3
            [6, 80, 3, 2, 3],   # Stage 4
            [6, 112, 3, 1, 5],  # Stage 5
            [6, 192, 4, 2, 5],  # Stage 6
            [6, 320, 1, 1, 3],  # Stage 7
        ]

        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.SiLU()
        )

        # Build stages
        stages = []
        in_channels = 32
        for expand_ratio, out_channels, num_blocks, stride, kernel in config:
            for i in range(num_blocks):
                stages.append(MBConvBlock(
                    in_channels,
                    out_channels,
                    expand_ratio,
                    stride if i == 0 else 1,
                    kernel
                ))
                in_channels = out_channels

        self.stages = nn.Sequential(*stages)

        # Head
        self.head = nn.Sequential(
            nn.Conv2d(320, 1280, 1, bias=False),
            nn.BatchNorm2d(1280),
            nn.SiLU(),
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Dropout(0.2),
            nn.Linear(1280, num_classes)
        )

    def forward(self, x):
        x = self.stem(x)
        x = self.stages(x)
        x = self.head(x)
        return x


# EfficientNet compound scaling
def efficientnet_scaling():
    """
    Demonstrate EfficientNet's compound scaling formula.
    """
    # Compound scaling coefficients
    # depth: d = alpha^phi
    # width: w = beta^phi
    # resolution: r = gamma^phi
    # Constraint: alpha * beta^2 * gamma^2 ≈ 2 (doubles FLOPS)

    alpha, beta, gamma = 1.2, 1.1, 1.15  # EfficientNet coefficients

    print("EfficientNet Compound Scaling:")
    print(f"Base coefficients: alpha={alpha}, beta={beta}, gamma={gamma}")
    print(f"Constraint: {alpha * beta**2 * gamma**2:.3f} ≈ 2")
    print()

    for phi, name in enumerate(['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7']):
        depth_mult = alpha ** phi
        width_mult = beta ** phi
        res_mult = gamma ** phi
        base_res = 224
        resolution = int(base_res * res_mult)
        print(f"EfficientNet-{name}: depth={depth_mult:.2f}x, "
              f"width={width_mult:.2f}x, resolution={resolution}")

efficientnet_scaling()

EfficientNet's contribution was demonstrating that scaling efficiency matters as much as raw accuracy. By carefully balancing network dimensions, it achieved better accuracy with fewer resources than ad-hoc scaling approaches.

Modern Architectural Patterns

Contemporary CNN architectures combine insights from all previous designs. Common patterns include residual connections, batch normalization, depthwise separable convolutions for efficiency, attention mechanisms for focusing on relevant features, and careful initialization and regularization strategies.

PYTHON

import torchvision.models as models

def load_pretrained_models():
    """
    Load state-of-the-art pretrained models from torchvision.
    These models are trained on ImageNet and transfer well to other tasks.
    """
    # Modern architectures available in torchvision
    architectures = {
        'ResNet-50': models.resnet50,
        'EfficientNet-B0': models.efficientnet_b0,
        'EfficientNet-B4': models.efficientnet_b4,
        'ConvNeXt-Tiny': models.convnext_tiny,
        'ConvNeXt-Base': models.convnext_base,
        'RegNetY-8GF': models.regnet_y_8gf,
    }

    print("Modern CNN Architectures:")
    print("-" * 60)

    for name, model_fn in architectures.items():
        model = model_fn(weights=None)  # No pretrained weights
        params = sum(p.numel() for p in model.parameters())
        print(f"{name}: {params / 1e6:.1f}M parameters")

    return architectures


class ConvNeXtBlock(nn.Module):
    """
    ConvNeXt block (2022): Modernized ConvNet inspired by Vision Transformers.
    Key changes: larger kernels, fewer activations, LayerNorm instead of BatchNorm.
    """

    def __init__(self, dim, drop_path=0., layer_scale_init=1e-6):
        super().__init__()
        # Depthwise convolution with large kernel (7x7)
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
        self.norm = nn.LayerNorm(dim, eps=1e-6)

        # Inverted bottleneck (expand by 4x)
        self.pwconv1 = nn.Linear(dim, 4 * dim)
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)

        # Layer scale for training stability
        self.gamma = nn.Parameter(
            layer_scale_init * torch.ones(dim),
            requires_grad=True
        ) if layer_scale_init > 0 else None

    def forward(self, x):
        identity = x

        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)  # (B, C, H, W) -> (B, H, W, C)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)

        if self.gamma is not None:
            x = self.gamma * x

        x = x.permute(0, 3, 1, 2)  # (B, H, W, C) -> (B, C, H, W)

        return identity + x

Key Takeaways

The evolution of CNN architectures reflects the field's growing understanding of what makes neural networks effective. From LeNet's foundational patterns through AlexNet's scale-enabled breakthroughs, VGGNet's exploration of depth, GoogLeNet's multi-scale processing, ResNet's skip connections, and EfficientNet's principled scaling, each generation contributed design principles that remain relevant today. Skip connections enable very deep networks by providing gradient highways. Batch normalization stabilizes training and acts as a regularizer. 1×1 convolutions efficiently adjust channel dimensions and add non-linearity. Depthwise separable convolutions dramatically reduce computation while maintaining representational power. Modern architectures like ConvNeXt demonstrate that CNNs can match transformer-based vision models when modernized with similar design choices. Understanding these architectural patterns enables practitioners to choose appropriate models for their applications and design custom architectures when needed.

20.3 Object Detection Intermediate

Object Detection

Object detection extends image classification by not only identifying what objects appear in an image but also localizing where each object exists through bounding boxes. This dual requirement of classification and localization makes object detection significantly more challenging than simple classification, requiring architectures that can efficiently propose candidate regions, extract features at multiple scales, and handle objects of varying sizes and aspect ratios. Object detection serves as the foundation for countless applications including autonomous driving, security surveillance, medical imaging, and retail analytics.

The Object Detection Task

Object detection requires a model to output a variable number of predictions, each consisting of a bounding box (typically four coordinates: x, y, width, height) and a class label with associated confidence score. The challenge lies in handling images with anywhere from zero to hundreds of objects, objects at vastly different scales, significant occlusion, and cluttered backgrounds. The model must also distinguish between objects that should be detected and background regions that should be ignored.

PYTHON

import torch
import torch.nn as nn
import torchvision
from torchvision.ops import box_iou, nms

def understand_detection_output():
    """
    Demonstrate the structure of object detection predictions.
    """
    # Detection output format: each detection has
    # - bounding box: [x_min, y_min, x_max, y_max] (or [x, y, w, h])
    # - class scores: probability for each class
    # - confidence: objectness score

    # Example: 3 detections in an image
    boxes = torch.tensor([
        [100, 150, 200, 300],  # Person bounding box
        [300, 100, 450, 250],  # Car bounding box
        [50, 200, 120, 350],   # Dog bounding box
    ], dtype=torch.float32)

    # Class probabilities (assuming 3 classes: person, car, dog)
    scores = torch.tensor([
        [0.95, 0.02, 0.03],  # 95% confident it's a person
        [0.01, 0.92, 0.07],  # 92% confident it's a car
        [0.05, 0.03, 0.92],  # 92% confident it's a dog
    ])

    # Objectness confidence (is this a valid detection?)
    confidences = torch.tensor([0.98, 0.95, 0.87])

    return boxes, scores, confidences


def compute_iou(box1, box2):
    """
    Compute Intersection over Union (IoU) between two boxes.
    IoU is the fundamental metric for evaluating detection quality.

    Args:
        box1, box2: Tensors of shape (4,) in [x1, y1, x2, y2] format
    """
    # Intersection coordinates
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])

    # Intersection area (0 if no overlap)
    intersection = max(0, x2 - x1) * max(0, y2 - y1)

    # Union area
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection

    return intersection / union if union > 0 else 0


# Demonstrate IoU calculation
box_a = torch.tensor([100, 100, 200, 200])  # Ground truth
box_b = torch.tensor([120, 110, 220, 210])  # Prediction (good overlap)
box_c = torch.tensor([300, 300, 400, 400])  # Prediction (no overlap)

print(f"IoU (good detection): {compute_iou(box_a, box_b):.3f}")
print(f"IoU (no overlap): {compute_iou(box_a, box_c):.3f}")

Two-Stage Detectors: R-CNN Family

Two-stage detectors separate detection into region proposal and classification stages. This approach achieves high accuracy by first identifying candidate regions that might contain objects, then classifying each region independently. While slower than single-stage methods, two-stage detectors often achieve superior accuracy, especially for small objects.

The R-CNN (Regions with CNN features) family progressed from R-CNN through Fast R-CNN to Faster R-CNN, each iteration addressing computational bottlenecks of its predecessor. Faster R-CNN introduced the Region Proposal Network (RPN), which shares convolutional features with the detection network, making the entire pipeline end-to-end trainable.

PYTHON

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.ops import RoIPool, RoIAlign

class RegionProposalNetwork(nn.Module):
    """
    Region Proposal Network (RPN) from Faster R-CNN.
    Generates object proposals from feature maps.
    """

    def __init__(self, in_channels, num_anchors=9):
        super().__init__()
        # Shared convolution for all anchors
        self.conv = nn.Conv2d(in_channels, 512, kernel_size=3, padding=1)

        # Classification head: object vs background
        self.cls_head = nn.Conv2d(512, num_anchors * 2, kernel_size=1)

        # Regression head: bounding box deltas
        self.reg_head = nn.Conv2d(512, num_anchors * 4, kernel_size=1)

    def forward(self, features):
        """
        Args:
            features: Feature map from backbone, shape (B, C, H, W)

        Returns:
            objectness: Shape (B, num_anchors * 2, H, W)
            bbox_deltas: Shape (B, num_anchors * 4, H, W)
        """
        x = F.relu(self.conv(features))
        objectness = self.cls_head(x)
        bbox_deltas = self.reg_head(x)
        return objectness, bbox_deltas


def generate_anchors(feature_size, image_size, scales=[128, 256, 512],
                     ratios=[0.5, 1.0, 2.0]):
    """
    Generate anchor boxes for each spatial position in feature map.
    Anchors define the prior boxes that RPN refines.
    """
    H, W = feature_size
    img_H, img_W = image_size

    # Stride: how many image pixels per feature map cell
    stride_h = img_H / H
    stride_w = img_W / W

    anchors = []
    for i in range(H):
        for j in range(W):
            # Center of this anchor position in image coordinates
            cy = (i + 0.5) * stride_h
            cx = (j + 0.5) * stride_w

            for scale in scales:
                for ratio in ratios:
                    # Anchor dimensions
                    h = scale * (ratio ** 0.5)
                    w = scale / (ratio ** 0.5)

                    # Anchor box [x1, y1, x2, y2]
                    x1 = cx - w / 2
                    y1 = cy - h / 2
                    x2 = cx + w / 2
                    y2 = cy + h / 2

                    anchors.append([x1, y1, x2, y2])

    return torch.tensor(anchors, dtype=torch.float32)


class FasterRCNN(nn.Module):
    """
    Simplified Faster R-CNN architecture.
    Two-stage detector: RPN proposes regions, then ROI head classifies them.
    """

    def __init__(self, backbone, num_classes):
        super().__init__()
        self.backbone = backbone
        self.rpn = RegionProposalNetwork(in_channels=512, num_anchors=9)

        # ROI feature extraction
        self.roi_align = RoIAlign(output_size=(7, 7), spatial_scale=1/16,
                                  sampling_ratio=2)

        # Classification and regression heads
        self.fc1 = nn.Linear(512 * 7 * 7, 1024)
        self.fc2 = nn.Linear(1024, 1024)
        self.cls_head = nn.Linear(1024, num_classes)
        self.reg_head = nn.Linear(1024, num_classes * 4)

    def forward(self, images, targets=None):
        # Extract features from backbone
        features = self.backbone(images)

        # Generate proposals from RPN
        objectness, rpn_deltas = self.rpn(features)

        # During training, sample proposals and compute losses
        # During inference, take top proposals and apply NMS
        proposals = self._get_proposals(objectness, rpn_deltas, images.shape)

        # Extract ROI features using ROI Align
        # proposals format: list of tensors, each (N, 4) for batch element
        roi_features = self.roi_align(features, proposals)
        roi_features = roi_features.flatten(start_dim=1)

        # Classification and regression
        x = F.relu(self.fc1(roi_features))
        x = F.relu(self.fc2(x))
        class_logits = self.cls_head(x)
        box_regression = self.reg_head(x)

        return class_logits, box_regression, proposals

    def _get_proposals(self, objectness, rpn_deltas, image_shape):
        # Simplified: in practice, this involves anchor generation,
        # delta application, NMS, and top-k selection
        pass


class RoIHeadWithFPN(nn.Module):
    """
    ROI head that handles multi-scale features from Feature Pyramid Network.
    Assigns ROIs to appropriate pyramid level based on size.
    """

    def __init__(self, in_channels, num_classes, roi_size=7):
        super().__init__()
        self.roi_align = RoIAlign(output_size=(roi_size, roi_size),
                                  spatial_scale=1.0,  # Adjusted per level
                                  sampling_ratio=2)

        hidden_dim = 1024
        self.fc1 = nn.Linear(in_channels * roi_size * roi_size, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.cls_score = nn.Linear(hidden_dim, num_classes)
        self.bbox_pred = nn.Linear(hidden_dim, num_classes * 4)

    def assign_rois_to_levels(self, rois, levels=[2, 3, 4, 5]):
        """
        Assign each ROI to appropriate FPN level based on its size.
        Smaller objects use higher resolution (lower level) features.
        """
        # ROI sizes
        widths = rois[:, 2] - rois[:, 0]
        heights = rois[:, 3] - rois[:, 1]
        areas = widths * heights

        # Level assignment formula from FPN paper
        # k = floor(k0 + log2(sqrt(area) / 224))
        k0 = 4  # Base level for 224x224 objects
        levels = torch.floor(k0 + torch.log2(torch.sqrt(areas) / 224 + 1e-6))
        levels = torch.clamp(levels, min=2, max=5).long()

        return levels

    def forward(self, feature_pyramid, rois):
        # Assign ROIs to pyramid levels
        roi_levels = self.assign_rois_to_levels(rois)

        # Pool features from appropriate level for each ROI
        pooled_features = []
        for level in [2, 3, 4, 5]:
            mask = roi_levels == level
            if mask.sum() > 0:
                level_rois = rois[mask]
                level_features = feature_pyramid[f'p{level}']
                scale = 1.0 / (2 ** level)
                pooled = self.roi_align(level_features, [level_rois], scale)
                pooled_features.append(pooled)

        features = torch.cat(pooled_features, dim=0)
        features = features.flatten(1)

        # Classification and regression
        x = F.relu(self.fc1(features))
        x = F.relu(self.fc2(x))
        scores = self.cls_score(x)
        deltas = self.bbox_pred(x)

        return scores, deltas

ROI Align, introduced in Mask R-CNN, improved upon ROI Pooling by using bilinear interpolation instead of quantization, preserving spatial precision crucial for accurate localization and downstream tasks like instance segmentation.

One-Stage Detectors: YOLO and SSD

One-stage detectors eliminate the region proposal step, directly predicting bounding boxes and class probabilities from the full image in a single forward pass. This approach trades some accuracy for significant speed improvements, enabling real-time detection.

YOLO (You Only Look Once) divides the image into a grid and predicts bounding boxes and class probabilities for each grid cell simultaneously. Subsequent versions (YOLOv2 through YOLOv8 and beyond) added batch normalization, anchor boxes, multi-scale prediction, and various architectural improvements.

PYTHON

class YOLOv1Head(nn.Module):
    """
    YOLO v1 detection head.
    Divides image into SxS grid, each cell predicts B boxes and C class probs.
    """

    def __init__(self, in_channels, S=7, B=2, C=20):
        super().__init__()
        self.S = S  # Grid size
        self.B = B  # Boxes per cell
        self.C = C  # Number of classes

        # Each cell predicts: B * 5 (x, y, w, h, conf) + C class probs
        output_size = S * S * (B * 5 + C)

        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(in_channels * S * S, 4096),
            nn.LeakyReLU(0.1),
            nn.Dropout(0.5),
            nn.Linear(4096, output_size)
        )

    def forward(self, x):
        # x shape: (batch, channels, S, S)
        batch_size = x.shape[0]
        out = self.fc(x)
        # Reshape to (batch, S, S, B*5 + C)
        out = out.view(batch_size, self.S, self.S, self.B * 5 + self.C)
        return out

    def decode_predictions(self, predictions, conf_threshold=0.5):
        """
        Convert YOLO output to bounding boxes.
        """
        batch_size = predictions.shape[0]
        boxes_list = []

        for b in range(batch_size):
            boxes = []
            for i in range(self.S):
                for j in range(self.S):
                    cell_pred = predictions[b, i, j]

                    # Class probabilities (last C values)
                    class_probs = cell_pred[-self.C:]

                    for box_idx in range(self.B):
                        # Box parameters
                        start = box_idx * 5
                        x, y, w, h, conf = cell_pred[start:start + 5]

                        # x, y are relative to cell, convert to image coords
                        x_img = (j + x) / self.S
                        y_img = (i + y) / self.S

                        # w, h are relative to image size (squared for stability)
                        w_img = w ** 2
                        h_img = h ** 2

                        # Class-specific confidence
                        class_scores = conf * class_probs

                        if class_scores.max() > conf_threshold:
                            class_id = class_scores.argmax()
                            boxes.append({
                                'box': [x_img, y_img, w_img, h_img],
                                'confidence': class_scores[class_id].item(),
                                'class': class_id.item()
                            })

            boxes_list.append(boxes)

        return boxes_list


class SSDHead(nn.Module):
    """
    Single Shot MultiBox Detector (SSD) head.
    Predicts at multiple feature map scales for detecting objects of various sizes.
    """

    def __init__(self, feature_channels, num_classes, num_anchors_per_cell):
        super().__init__()
        self.num_classes = num_classes

        # Separate predictors for each feature map scale
        self.class_predictors = nn.ModuleList()
        self.box_predictors = nn.ModuleList()

        for in_ch, num_anchors in zip(feature_channels, num_anchors_per_cell):
            self.class_predictors.append(
                nn.Conv2d(in_ch, num_anchors * num_classes, kernel_size=3, padding=1)
            )
            self.box_predictors.append(
                nn.Conv2d(in_ch, num_anchors * 4, kernel_size=3, padding=1)
            )

    def forward(self, feature_maps):
        """
        Args:
            feature_maps: List of tensors from different backbone levels

        Returns:
            class_preds: (batch, total_anchors, num_classes)
            box_preds: (batch, total_anchors, 4)
        """
        class_preds = []
        box_preds = []

        for feat, cls_pred, box_pred in zip(feature_maps,
                                            self.class_predictors,
                                            self.box_predictors):
            batch_size = feat.shape[0]

            # Class predictions
            cls = cls_pred(feat)  # (B, num_anchors * num_classes, H, W)
            cls = cls.permute(0, 2, 3, 1).reshape(batch_size, -1, self.num_classes)
            class_preds.append(cls)

            # Box predictions
            box = box_pred(feat)  # (B, num_anchors * 4, H, W)
            box = box.permute(0, 2, 3, 1).reshape(batch_size, -1, 4)
            box_preds.append(box)

        # Concatenate predictions from all scales
        class_preds = torch.cat(class_preds, dim=1)
        box_preds = torch.cat(box_preds, dim=1)

        return class_preds, box_preds


class YOLOv3Neck(nn.Module):
    """
    Feature Pyramid Network-style neck for YOLOv3.
    Combines features from multiple scales with upsampling and concatenation.
    """

    def __init__(self, in_channels_list):
        super().__init__()
        # in_channels_list: channels from backbone at different scales
        # e.g., [256, 512, 1024] for scales 52x52, 26x26, 13x13

        self.lateral_convs = nn.ModuleList()
        self.upsample_convs = nn.ModuleList()

        for i, in_ch in enumerate(in_channels_list):
            # 1x1 conv to reduce channels
            self.lateral_convs.append(
                nn.Conv2d(in_ch, in_ch // 2, kernel_size=1)
            )

            if i < len(in_channels_list) - 1:
                # Conv after concatenation with upsampled features
                self.upsample_convs.append(
                    nn.Conv2d(in_ch, in_ch // 2, kernel_size=1)
                )

    def forward(self, features):
        """
        Args:
            features: List of feature maps [large_scale, ..., small_scale]

        Returns:
            List of enhanced feature maps for detection
        """
        outputs = []

        # Start from smallest scale (deepest features)
        x = features[-1]
        x = self.lateral_convs[-1](x)
        outputs.append(x)

        # Progressively upsample and combine with larger scale features
        for i in range(len(features) - 2, -1, -1):
            # Upsample previous output
            x_up = F.interpolate(x, scale_factor=2, mode='nearest')

            # Lateral connection from backbone
            lateral = self.lateral_convs[i](features[i])

            # Combine (YOLOv3 uses concatenation, FPN uses addition)
            x = torch.cat([x_up, lateral], dim=1)
            x = self.upsample_convs[i](x)
            outputs.insert(0, x)

        return outputs

The key innovation of multi-scale detection enables these models to handle objects of vastly different sizes. Small objects are detected at higher resolution feature maps (early in the network), while large objects are detected at lower resolution maps (deeper in the network).

RetinaNet and Focal Loss

RetinaNet addressed a fundamental problem with one-stage detectors: class imbalance. Most spatial locations in an image are background, creating an overwhelming number of easy negative examples that dominate the loss. Focal Loss down-weights well-classified examples, focusing training on hard negatives.

PYTHON

class FocalLoss(nn.Module):
    """
    Focal Loss for addressing class imbalance in object detection.
    FL(p) = -alpha * (1 - p)^gamma * log(p)

    gamma > 0 reduces loss for well-classified examples
    alpha balances positive/negative classes
    """

    def __init__(self, alpha=0.25, gamma=2.0, reduction='mean'):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.reduction = reduction

    def forward(self, predictions, targets):
        """
        Args:
            predictions: (N, num_classes) logits
            targets: (N,) class indices
        """
        ce_loss = F.cross_entropy(predictions, targets, reduction='none')
        p = torch.softmax(predictions, dim=1)
        p_t = p.gather(1, targets.unsqueeze(1)).squeeze()

        # Focal weight: (1 - p_t)^gamma
        focal_weight = (1 - p_t) ** self.gamma

        # Alpha weighting for class balance
        alpha_t = torch.where(targets > 0, self.alpha, 1 - self.alpha)

        loss = alpha_t * focal_weight * ce_loss

        if self.reduction == 'mean':
            return loss.mean()
        elif self.reduction == 'sum':
            return loss.sum()
        return loss


class RetinaNetHead(nn.Module):
    """
    RetinaNet detection head with class and box subnet.
    Uses same architecture for all FPN levels (shared weights).
    """

    def __init__(self, in_channels, num_anchors, num_classes):
        super().__init__()
        self.num_classes = num_classes
        self.num_anchors = num_anchors

        # Classification subnet: 4 conv layers + output
        self.cls_subnet = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, in_channels, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, in_channels, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, in_channels, 3, padding=1),
            nn.ReLU(inplace=True),
        )
        self.cls_output = nn.Conv2d(in_channels, num_anchors * num_classes, 3, padding=1)

        # Box regression subnet: 4 conv layers + output
        self.box_subnet = nn.Sequential(
            nn.Conv2d(in_channels, in_channels, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, in_channels, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, in_channels, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, in_channels, 3, padding=1),
            nn.ReLU(inplace=True),
        )
        self.box_output = nn.Conv2d(in_channels, num_anchors * 4, 3, padding=1)

        # Initialize classification bias for rare positive class
        self._init_weights()

    def _init_weights(self):
        # Initialize classification output with prior probability
        # This improves training stability at the start
        prior_prob = 0.01
        bias_value = -torch.log(torch.tensor((1 - prior_prob) / prior_prob))
        nn.init.constant_(self.cls_output.bias, bias_value)

    def forward(self, features):
        """
        Args:
            features: List of FPN feature maps

        Returns:
            class_outputs: List of (B, A*C, H, W) tensors
            box_outputs: List of (B, A*4, H, W) tensors
        """
        class_outputs = []
        box_outputs = []

        for feature in features:
            cls_feat = self.cls_subnet(feature)
            box_feat = self.box_subnet(feature)

            class_outputs.append(self.cls_output(cls_feat))
            box_outputs.append(self.box_output(box_feat))

        return class_outputs, box_outputs

Anchor-Free Detection

Recent approaches eliminate predefined anchor boxes entirely, instead directly predicting object centers and sizes. This simplifies the architecture and removes hyperparameters like anchor scales and aspect ratios.

PYTHON

class CenterNetHead(nn.Module):
    """
    CenterNet detection head: predicts object centers as heatmap peaks.
    Anchor-free approach that detects objects by their center points.
    """

    def __init__(self, in_channels, num_classes):
        super().__init__()

        # Heatmap head: predicts center point probability for each class
        self.heatmap = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, num_classes, 1),
            nn.Sigmoid()
        )

        # Width/height head: predicts object size at each center
        self.wh = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 2, 1)  # width, height
        )

        # Offset head: sub-pixel center refinement
        self.offset = nn.Sequential(
            nn.Conv2d(in_channels, 256, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 2, 1)  # x_offset, y_offset
        )

    def forward(self, x):
        return {
            'heatmap': self.heatmap(x),
            'wh': self.wh(x),
            'offset': self.offset(x)
        }

    def decode(self, outputs, K=100):
        """
        Decode predictions to bounding boxes.
        Find top K peaks in heatmap and extract corresponding boxes.
        """
        heatmap = outputs['heatmap']
        wh = outputs['wh']
        offset = outputs['offset']

        batch_size, num_classes, H, W = heatmap.shape

        # Apply NMS via max pooling
        heatmap_pooled = F.max_pool2d(heatmap, 3, stride=1, padding=1)
        keep = (heatmap == heatmap_pooled).float()
        heatmap = heatmap * keep

        # Get top K detections
        heatmap_flat = heatmap.view(batch_size, -1)
        topk_scores, topk_inds = torch.topk(heatmap_flat, K)

        # Convert indices to coordinates
        topk_classes = topk_inds // (H * W)
        topk_spatial_inds = topk_inds % (H * W)
        topk_ys = topk_spatial_inds // W
        topk_xs = topk_spatial_inds % W

        # Get offset and size at peak locations
        # ... (indexing and box construction)

        return topk_scores, topk_classes, topk_xs, topk_ys


class FCOSHead(nn.Module):
    """
    FCOS: Fully Convolutional One-Stage Object Detection.
    Anchor-free detector predicting distance to box boundaries.
    """

    def __init__(self, in_channels, num_classes):
        super().__init__()

        # Classification branch
        self.cls_tower = self._make_tower(in_channels)
        self.cls_logits = nn.Conv2d(in_channels, num_classes, 3, padding=1)

        # Regression branch
        self.bbox_tower = self._make_tower(in_channels)
        self.bbox_pred = nn.Conv2d(in_channels, 4, 3, padding=1)  # l, t, r, b

        # Centerness branch (helps suppress low-quality boxes)
        self.centerness = nn.Conv2d(in_channels, 1, 3, padding=1)

    def _make_tower(self, in_channels, num_convs=4):
        layers = []
        for _ in range(num_convs):
            layers.extend([
                nn.Conv2d(in_channels, in_channels, 3, padding=1),
                nn.GroupNorm(32, in_channels),
                nn.ReLU(inplace=True)
            ])
        return nn.Sequential(*layers)

    def forward(self, features):
        """
        Predict at each spatial location:
        - Class scores
        - Distances to box boundaries (left, top, right, bottom)
        - Centerness score
        """
        cls_feat = self.cls_tower(features)
        bbox_feat = self.bbox_tower(features)

        cls_score = self.cls_logits(cls_feat)
        bbox_pred = F.relu(self.bbox_pred(bbox_feat))  # distances are positive
        centerness = self.centerness(cls_feat)

        return cls_score, bbox_pred, centerness

DETR: Transformers for Detection

DETR (DEtection TRansformer) revolutionized object detection by framing it as a set prediction problem, using transformers to directly output a fixed set of predictions. This eliminates the need for hand-designed components like anchor generation, NMS, and proposal filtering.

PYTHON

class DETRTransformerDecoder(nn.Module):
    """
    DETR transformer decoder that converts object queries to detections.
    """

    def __init__(self, d_model=256, nhead=8, num_layers=6, num_queries=100):
        super().__init__()

        # Learnable object queries
        self.query_embed = nn.Embedding(num_queries, d_model)

        # Transformer decoder layers
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=2048,
            dropout=0.1
        )
        self.transformer_decoder = nn.TransformerDecoder(
            decoder_layer, num_layers=num_layers
        )

    def forward(self, memory, pos_embed):
        """
        Args:
            memory: Encoded image features from encoder (seq_len, batch, d_model)
            pos_embed: Positional encodings

        Returns:
            Decoded object queries (num_queries, batch, d_model)
        """
        batch_size = memory.shape[1]

        # Object queries: same for all images
        queries = self.query_embed.weight.unsqueeze(1).repeat(1, batch_size, 1)

        # Decode: queries attend to encoded image features
        output = self.transformer_decoder(
            tgt=queries,
            memory=memory,
            memory_key_padding_mask=None
        )

        return output


class DETR(nn.Module):
    """
    Simplified DETR implementation.
    End-to-end object detection with transformers.
    """

    def __init__(self, backbone, num_classes, hidden_dim=256, num_queries=100):
        super().__init__()
        self.backbone = backbone
        self.num_queries = num_queries

        # Project backbone features to transformer dimension
        self.input_proj = nn.Conv2d(2048, hidden_dim, kernel_size=1)

        # Transformer encoder-decoder
        self.transformer = nn.Transformer(
            d_model=hidden_dim,
            nhead=8,
            num_encoder_layers=6,
            num_decoder_layers=6
        )

        # Learnable object queries
        self.query_embed = nn.Embedding(num_queries, hidden_dim)

        # Prediction heads
        self.class_head = nn.Linear(hidden_dim, num_classes + 1)  # +1 for no object
        self.bbox_head = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 4)
        )

        # Positional encoding for spatial positions
        self.pos_embed = PositionalEncoding2D(hidden_dim)

    def forward(self, images):
        # Extract features from backbone
        features = self.backbone(images)  # (B, 2048, H/32, W/32)
        features = self.input_proj(features)  # (B, hidden_dim, H, W)

        B, C, H, W = features.shape

        # Flatten spatial dimensions
        features_flat = features.flatten(2).permute(2, 0, 1)  # (HW, B, C)

        # Add positional encoding
        pos = self.pos_embed(features).flatten(2).permute(2, 0, 1)

        # Object queries
        query = self.query_embed.weight.unsqueeze(1).repeat(1, B, 1)

        # Transformer
        output = self.transformer(features_flat + pos, query)

        # Predictions
        output = output.permute(1, 0, 2)  # (B, num_queries, hidden_dim)
        class_logits = self.class_head(output)
        bbox_pred = self.bbox_head(output).sigmoid()  # normalized [0, 1]

        return {'pred_logits': class_logits, 'pred_boxes': bbox_pred}


class PositionalEncoding2D(nn.Module):
    """2D positional encoding for DETR."""

    def __init__(self, hidden_dim):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.row_embed = nn.Embedding(50, hidden_dim // 2)
        self.col_embed = nn.Embedding(50, hidden_dim // 2)

    def forward(self, x):
        B, C, H, W = x.shape
        i = torch.arange(W, device=x.device)
        j = torch.arange(H, device=x.device)
        x_emb = self.col_embed(i)
        y_emb = self.row_embed(j)
        pos = torch.cat([
            x_emb.unsqueeze(0).repeat(H, 1, 1),
            y_emb.unsqueeze(1).repeat(1, W, 1),
        ], dim=-1).permute(2, 0, 1).unsqueeze(0).repeat(B, 1, 1, 1)
        return pos

DETR uses Hungarian matching during training to find the optimal assignment between predictions and ground truth, allowing the model to learn without the heuristics required by traditional detectors.

Non-Maximum Suppression

Most detection methods produce multiple overlapping predictions for each object. Non-Maximum Suppression (NMS) filters these to keep only the best prediction per object.

PYTHON

def non_maximum_suppression(boxes, scores, iou_threshold=0.5):
    """
    Standard NMS: iteratively select highest-scoring box
    and remove boxes with high IoU overlap.
    """
    if len(boxes) == 0:
        return torch.tensor([])

    # Sort by score descending
    order = scores.argsort(descending=True)

    keep = []
    while len(order) > 0:
        # Keep highest scoring box
        i = order[0]
        keep.append(i)

        if len(order) == 1:
            break

        # Compute IoU of remaining boxes with this box
        ious = box_iou(boxes[order[1:]], boxes[i:i+1]).squeeze()

        # Keep boxes with IoU below threshold
        mask = ious < iou_threshold
        order = order[1:][mask]

    return torch.tensor(keep)


def soft_nms(boxes, scores, sigma=0.5, score_threshold=0.001):
    """
    Soft-NMS: instead of removing overlapping boxes,
    decay their scores based on IoU.
    """
    N = boxes.shape[0]
    indices = torch.arange(N)

    for i in range(N):
        # Find max score among remaining
        max_idx = scores[i:].argmax() + i

        # Swap with current position
        boxes[[i, max_idx]] = boxes[[max_idx, i]]
        scores[[i, max_idx]] = scores[[max_idx, i]]
        indices[[i, max_idx]] = indices[[max_idx, i]]

        # Decay scores of overlapping boxes
        ious = box_iou(boxes[i:i+1], boxes[i+1:]).squeeze()
        decay = torch.exp(-(ious ** 2) / sigma)
        scores[i+1:] *= decay

    # Filter by score threshold
    keep = scores > score_threshold
    return indices[keep]

Key Takeaways

Object detection requires simultaneously solving localization and classification, making it fundamentally more complex than image classification. Two-stage detectors like Faster R-CNN achieve high accuracy by separating region proposal from classification, using ROI Align to extract precise features from candidate regions. One-stage detectors like YOLO and SSD achieve real-time performance by predicting all boxes in a single forward pass, using multi-scale feature maps to handle objects of varying sizes. Focal Loss addresses the severe class imbalance inherent in detection, enabling one-stage detectors to match two-stage accuracy. Anchor-free methods like CenterNet and FCOS simplify architectures by eliminating predefined anchor boxes. DETR represents a paradigm shift, using transformers to treat detection as set prediction and eliminating hand-designed components like NMS. Understanding these diverse approaches enables practitioners to select appropriate methods based on their accuracy, speed, and complexity requirements.

20.4 Image Segmentation Intermediate

Image Segmentation

Image segmentation partitions an image into meaningful regions, assigning labels at the pixel level rather than the image or bounding box level. This fine-grained understanding enables applications ranging from autonomous driving, where distinguishing road from sidewalk from pedestrian is critical, to medical imaging, where precisely delineating tumors from healthy tissue can be life-saving. Segmentation comes in several flavors: semantic segmentation assigns a class to every pixel, instance segmentation distinguishes between individual objects of the same class, and panoptic segmentation combines both to provide complete scene understanding.

Semantic Segmentation Fundamentals

Semantic segmentation produces a dense prediction where every pixel receives a class label. Unlike object detection, which produces sparse bounding boxes, segmentation must classify every single pixel in the image. The output is a tensor with the same spatial dimensions as the input, with each spatial location containing either a class index or class probabilities.

PYTHON

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

def understand_segmentation_task():
    """
    Demonstrate the semantic segmentation task structure.
    """
    # Input: RGB image of shape (B, 3, H, W)
    batch_size, height, width = 1, 256, 256
    image = torch.randn(batch_size, 3, height, width)

    # Output: class logits for each pixel (B, num_classes, H, W)
    num_classes = 21  # e.g., PASCAL VOC has 21 classes
    logits = torch.randn(batch_size, num_classes, height, width)

    # Convert to class predictions
    predictions = logits.argmax(dim=1)  # (B, H, W) with values in [0, num_classes-1]

    # Ground truth: single channel with class indices
    ground_truth = torch.randint(0, num_classes, (batch_size, height, width))

    print(f"Input shape: {image.shape}")
    print(f"Logits shape: {logits.shape}")
    print(f"Predictions shape: {predictions.shape}")
    print(f"Ground truth shape: {ground_truth.shape}")

    return predictions, ground_truth


def pixel_accuracy(predictions, targets):
    """
    Basic pixel accuracy: fraction of correctly classified pixels.
    Simple but can be misleading with class imbalance.
    """
    correct = (predictions == targets).sum()
    total = targets.numel()
    return (correct / total).item()


def mean_iou(predictions, targets, num_classes):
    """
    Mean Intersection over Union (mIoU): standard segmentation metric.
    Computes IoU for each class and averages.
    """
    ious = []

    for cls in range(num_classes):
        pred_mask = predictions == cls
        target_mask = targets == cls

        intersection = (pred_mask & target_mask).sum().float()
        union = (pred_mask | target_mask).sum().float()

        if union > 0:
            iou = intersection / union
            ious.append(iou.item())

    return np.mean(ious) if ious else 0.0


# Example metrics calculation
predictions, ground_truth = understand_segmentation_task()
print(f"\nPixel Accuracy: {pixel_accuracy(predictions, ground_truth):.4f}")
print(f"Mean IoU: {mean_iou(predictions, ground_truth, num_classes=21):.4f}")

The key challenge in semantic segmentation is maintaining spatial resolution through the network. Standard CNNs progressively reduce spatial dimensions through pooling and strided convolutions, losing fine-grained spatial information needed for precise pixel-level predictions.

Fully Convolutional Networks (FCN)

Fully Convolutional Networks pioneered the encoder-decoder paradigm for segmentation. The encoder extracts hierarchical features through a pretrained classification network, while the decoder upsamples features back to the original resolution. Skip connections from encoder to decoder preserve fine spatial details that would otherwise be lost.

PYTHON

class FCN(nn.Module):
    """
    Fully Convolutional Network for Semantic Segmentation.
    Converts classification network to dense prediction by replacing
    fully connected layers with convolutions.
    """

    def __init__(self, num_classes, backbone='vgg16'):
        super().__init__()

        # Encoder: VGG-style feature extraction
        self.enc1 = self._make_encoder_block(3, 64, 2)      # /2
        self.enc2 = self._make_encoder_block(64, 128, 2)    # /4
        self.enc3 = self._make_encoder_block(128, 256, 3)   # /8
        self.enc4 = self._make_encoder_block(256, 512, 3)   # /16
        self.enc5 = self._make_encoder_block(512, 512, 3)   # /32

        # Fully convolutional layers (replace FC layers)
        self.fc6 = nn.Conv2d(512, 4096, kernel_size=7, padding=3)
        self.fc7 = nn.Conv2d(4096, 4096, kernel_size=1)

        # Score layers for multi-scale fusion
        self.score_fr = nn.Conv2d(4096, num_classes, kernel_size=1)
        self.score_pool4 = nn.Conv2d(512, num_classes, kernel_size=1)
        self.score_pool3 = nn.Conv2d(256, num_classes, kernel_size=1)

        # Upsampling layers
        self.upscore2 = nn.ConvTranspose2d(num_classes, num_classes,
                                            kernel_size=4, stride=2, padding=1)
        self.upscore4 = nn.ConvTranspose2d(num_classes, num_classes,
                                            kernel_size=4, stride=2, padding=1)
        self.upscore8 = nn.ConvTranspose2d(num_classes, num_classes,
                                            kernel_size=16, stride=8, padding=4)

    def _make_encoder_block(self, in_channels, out_channels, num_convs):
        layers = []
        for i in range(num_convs):
            layers.append(nn.Conv2d(
                in_channels if i == 0 else out_channels,
                out_channels,
                kernel_size=3,
                padding=1
            ))
            layers.append(nn.ReLU(inplace=True))
        layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
        return nn.Sequential(*layers)

    def forward(self, x):
        # Encoder path
        x1 = self.enc1(x)    # 1/2
        x2 = self.enc2(x1)   # 1/4
        x3 = self.enc3(x2)   # 1/8, pool3 features
        x4 = self.enc4(x3)   # 1/16, pool4 features
        x5 = self.enc5(x4)   # 1/32

        # FC layers as convolutions
        x = F.relu(self.fc6(x5))
        x = F.dropout(x, p=0.5, training=self.training)
        x = F.relu(self.fc7(x))
        x = F.dropout(x, p=0.5, training=self.training)

        # FCN-8s: fuse pool3, pool4, and fc7 features
        score_fr = self.score_fr(x)
        score_fr_2x = self.upscore2(score_fr)

        score_pool4 = self.score_pool4(x4)
        score_fused4 = score_fr_2x + score_pool4
        score_fused4_2x = self.upscore4(score_fused4)

        score_pool3 = self.score_pool3(x3)
        score_fused3 = score_fused4_2x + score_pool3

        # Final 8x upsampling
        output = self.upscore8(score_fused3)

        return output

FCN introduced the crucial insight that classification networks could be converted to dense prediction networks by replacing fully connected layers with convolutions and adding upsampling paths. The skip connections in FCN-8s combine coarse semantic information from deep layers with fine spatial information from shallow layers.

U-Net Architecture

U-Net, originally designed for biomedical image segmentation, features a symmetric encoder-decoder architecture with extensive skip connections. The U-shaped architecture combines context information from contracting layers with localization information from expanding layers, enabling precise segmentation with limited training data.

PYTHON

class DoubleConv(nn.Module):
    """Two consecutive 3x3 convolutions with BatchNorm and ReLU."""

    def __init__(self, in_channels, out_channels, mid_channels=None):
        super().__init__()
        mid_channels = mid_channels or out_channels

        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(mid_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(mid_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

    def forward(self, x):
        return self.double_conv(x)


class Down(nn.Module):
    """Downsampling: MaxPool followed by DoubleConv."""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.maxpool_conv = nn.Sequential(
            nn.MaxPool2d(2),
            DoubleConv(in_channels, out_channels)
        )

    def forward(self, x):
        return self.maxpool_conv(x)


class Up(nn.Module):
    """Upsampling with skip connection concatenation."""

    def __init__(self, in_channels, out_channels, bilinear=True):
        super().__init__()

        if bilinear:
            self.up = nn.Upsample(scale_factor=2, mode='bilinear',
                                   align_corners=True)
            self.conv = DoubleConv(in_channels, out_channels,
                                   mid_channels=in_channels // 2)
        else:
            self.up = nn.ConvTranspose2d(in_channels, in_channels // 2,
                                          kernel_size=2, stride=2)
            self.conv = DoubleConv(in_channels, out_channels)

    def forward(self, x1, x2):
        """
        x1: features from decoder (to be upsampled)
        x2: features from encoder (skip connection)
        """
        x1 = self.up(x1)

        # Handle size mismatch due to pooling
        diff_y = x2.size()[2] - x1.size()[2]
        diff_x = x2.size()[3] - x1.size()[3]

        x1 = F.pad(x1, [diff_x // 2, diff_x - diff_x // 2,
                        diff_y // 2, diff_y - diff_y // 2])

        # Concatenate skip connection
        x = torch.cat([x2, x1], dim=1)
        return self.conv(x)


class UNet(nn.Module):
    """
    U-Net: Encoder-decoder with skip connections.
    Original paper: "U-Net: Convolutional Networks for Biomedical Image Segmentation"
    """

    def __init__(self, in_channels=3, num_classes=2, base_features=64, bilinear=True):
        super().__init__()
        self.in_channels = in_channels
        self.num_classes = num_classes
        self.bilinear = bilinear

        # Encoder (contracting path)
        self.inc = DoubleConv(in_channels, base_features)
        self.down1 = Down(base_features, base_features * 2)
        self.down2 = Down(base_features * 2, base_features * 4)
        self.down3 = Down(base_features * 4, base_features * 8)

        factor = 2 if bilinear else 1
        self.down4 = Down(base_features * 8, base_features * 16 // factor)

        # Decoder (expanding path)
        self.up1 = Up(base_features * 16, base_features * 8 // factor, bilinear)
        self.up2 = Up(base_features * 8, base_features * 4 // factor, bilinear)
        self.up3 = Up(base_features * 4, base_features * 2 // factor, bilinear)
        self.up4 = Up(base_features * 2, base_features, bilinear)

        # Output layer
        self.outc = nn.Conv2d(base_features, num_classes, kernel_size=1)

    def forward(self, x):
        # Encoder
        x1 = self.inc(x)
        x2 = self.down1(x1)
        x3 = self.down2(x2)
        x4 = self.down3(x3)
        x5 = self.down4(x4)

        # Decoder with skip connections
        x = self.up1(x5, x4)
        x = self.up2(x, x3)
        x = self.up3(x, x2)
        x = self.up4(x, x1)

        # Output
        logits = self.outc(x)
        return logits


# Demonstrate U-Net dimensions
def trace_unet_dimensions():
    model = UNet(in_channels=3, num_classes=21, base_features=64)
    x = torch.randn(1, 3, 256, 256)

    # Manual trace through encoder
    x1 = model.inc(x)
    print(f"After inc: {x1.shape}")  # (1, 64, 256, 256)

    x2 = model.down1(x1)
    print(f"After down1: {x2.shape}")  # (1, 128, 128, 128)

    x3 = model.down2(x2)
    print(f"After down2: {x3.shape}")  # (1, 256, 64, 64)

    x4 = model.down3(x3)
    print(f"After down3: {x4.shape}")  # (1, 512, 32, 32)

    x5 = model.down4(x4)
    print(f"After down4 (bottleneck): {x5.shape}")  # (1, 512, 16, 16)

    # Full forward pass
    output = model(x)
    print(f"Output: {output.shape}")  # (1, 21, 256, 256)

trace_unet_dimensions()

U-Net's symmetric structure with concatenation-based skip connections has proven remarkably effective, becoming the de facto architecture for medical image segmentation and influencing numerous subsequent designs.

DeepLab and Dilated Convolutions

DeepLab addressed the resolution loss problem differently by using dilated (atrous) convolutions. Dilated convolutions expand the receptive field without reducing spatial resolution or increasing parameters, enabling dense feature extraction at multiple scales.

PYTHON

class AtrousConv(nn.Module):
    """Atrous (dilated) convolution for expanded receptive field."""

    def __init__(self, in_channels, out_channels, kernel_size=3, dilation=1):
        super().__init__()
        padding = dilation * (kernel_size - 1) // 2
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size,
                              padding=padding, dilation=dilation, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        return self.relu(self.bn(self.conv(x)))


class ASPP(nn.Module):
    """
    Atrous Spatial Pyramid Pooling from DeepLabV3.
    Captures multi-scale context using parallel dilated convolutions.
    """

    def __init__(self, in_channels, out_channels, rates=[6, 12, 18]):
        super().__init__()

        # 1x1 convolution
        self.conv1x1 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

        # Dilated convolutions at different rates
        self.atrous_convs = nn.ModuleList()
        for rate in rates:
            self.atrous_convs.append(
                nn.Sequential(
                    nn.Conv2d(in_channels, out_channels, kernel_size=3,
                             padding=rate, dilation=rate, bias=False),
                    nn.BatchNorm2d(out_channels),
                    nn.ReLU(inplace=True)
                )
            )

        # Global average pooling branch (image-level features)
        self.global_pool = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )

        # Project concatenated features
        num_branches = 2 + len(rates)  # 1x1 + atrous branches + global pool
        self.project = nn.Sequential(
            nn.Conv2d(num_branches * out_channels, out_channels,
                     kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5)
        )

    def forward(self, x):
        H, W = x.shape[2:]

        # 1x1 branch
        feat1x1 = self.conv1x1(x)

        # Atrous branches
        atrous_feats = [conv(x) for conv in self.atrous_convs]

        # Global pooling branch (upsample to match spatial size)
        global_feat = self.global_pool(x)
        global_feat = F.interpolate(global_feat, size=(H, W),
                                     mode='bilinear', align_corners=True)

        # Concatenate all branches
        concat = torch.cat([feat1x1] + atrous_feats + [global_feat], dim=1)

        return self.project(concat)


class DeepLabV3Plus(nn.Module):
    """
    DeepLabV3+ with encoder-decoder structure.
    Combines ASPP for multi-scale context with low-level feature refinement.
    """

    def __init__(self, backbone, num_classes, aspp_channels=256):
        super().__init__()
        self.backbone = backbone
        self.aspp = ASPP(in_channels=2048, out_channels=aspp_channels)

        # Low-level feature processing (from early encoder layer)
        self.low_level_conv = nn.Sequential(
            nn.Conv2d(256, 48, kernel_size=1, bias=False),  # Reduce channels
            nn.BatchNorm2d(48),
            nn.ReLU(inplace=True)
        )

        # Decoder
        self.decoder = nn.Sequential(
            nn.Conv2d(aspp_channels + 48, 256, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, num_classes, kernel_size=1)
        )

    def forward(self, x):
        input_size = x.shape[2:]

        # Backbone feature extraction (modified to output multi-scale features)
        low_level_feat, high_level_feat = self.backbone(x)
        # low_level_feat: 1/4 resolution (e.g., from layer1)
        # high_level_feat: 1/16 resolution (e.g., from layer4)

        # ASPP on high-level features
        aspp_out = self.aspp(high_level_feat)

        # Upsample ASPP output to match low-level features
        aspp_out = F.interpolate(aspp_out,
                                  size=low_level_feat.shape[2:],
                                  mode='bilinear',
                                  align_corners=True)

        # Process low-level features
        low_level_out = self.low_level_conv(low_level_feat)

        # Concatenate and decode
        concat = torch.cat([aspp_out, low_level_out], dim=1)
        decoder_out = self.decoder(concat)

        # Final upsampling to input resolution
        output = F.interpolate(decoder_out,
                                size=input_size,
                                mode='bilinear',
                                align_corners=True)

        return output


def demonstrate_dilation_receptive_field():
    """
    Show how dilated convolutions expand receptive field.
    """
    # Standard 3x3 conv: receptive field = 3
    # Stack of two 3x3: receptive field = 5
    # Stack of three 3x3: receptive field = 7

    # Single 3x3 with dilation=2: receptive field = 5
    # Single 3x3 with dilation=4: receptive field = 9
    # Single 3x3 with dilation=8: receptive field = 17

    print("Receptive Field Comparison:")
    print("3x3, dilation=1: RF=3")
    print("3x3, dilation=2: RF=5 (same params as standard 3x3)")
    print("3x3, dilation=4: RF=9")
    print("3x3, dilation=6: RF=13")

    # Parameters remain the same regardless of dilation
    conv_standard = nn.Conv2d(64, 64, kernel_size=3, padding=1, dilation=1)
    conv_dilated = nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2)

    print(f"\nStandard conv params: {sum(p.numel() for p in conv_standard.parameters())}")
    print(f"Dilated conv params: {sum(p.numel() for p in conv_dilated.parameters())}")

demonstrate_dilation_receptive_field()

DeepLab's Atrous Spatial Pyramid Pooling (ASPP) captures context at multiple scales simultaneously, addressing the challenge of segmenting objects that vary significantly in size within the same image.

Instance Segmentation with Mask R-CNN

Instance segmentation distinguishes between individual objects of the same class, producing separate masks for each detected object. Mask R-CNN extends Faster R-CNN by adding a mask prediction branch that operates on each detected region.

PYTHON

class MaskHead(nn.Module):
    """
    Mask prediction head from Mask R-CNN.
    Predicts binary masks for each detected object instance.
    """

    def __init__(self, in_channels, num_classes, hidden_dim=256, mask_size=28):
        super().__init__()
        self.mask_size = mask_size

        # Four conv layers
        self.conv1 = nn.Conv2d(in_channels, hidden_dim, 3, padding=1)
        self.conv2 = nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1)
        self.conv3 = nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1)
        self.conv4 = nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1)

        # Upsample and predict
        self.deconv = nn.ConvTranspose2d(hidden_dim, hidden_dim,
                                          kernel_size=2, stride=2)
        self.mask_pred = nn.Conv2d(hidden_dim, num_classes, 1)

    def forward(self, x):
        """
        Args:
            x: ROI-aligned features, shape (N, C, H, W) where N is num proposals

        Returns:
            masks: (N, num_classes, 2*H, 2*W) mask logits for each class
        """
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))

        x = F.relu(self.deconv(x))
        masks = self.mask_pred(x)

        return masks


class MaskRCNN(nn.Module):
    """
    Simplified Mask R-CNN architecture.
    Extends Faster R-CNN with instance mask prediction.
    """

    def __init__(self, backbone, num_classes):
        super().__init__()
        self.backbone = backbone

        # Region Proposal Network
        self.rpn = RegionProposalNetwork(in_channels=256, num_anchors=9)

        # ROI feature extraction
        self.roi_align = RoIAlign(output_size=(7, 7), spatial_scale=1/32,
                                  sampling_ratio=2)
        self.roi_align_mask = RoIAlign(output_size=(14, 14), spatial_scale=1/32,
                                        sampling_ratio=2)

        # Box head (classification and regression)
        self.box_head = nn.Sequential(
            nn.Linear(256 * 7 * 7, 1024),
            nn.ReLU(),
            nn.Linear(1024, 1024),
            nn.ReLU()
        )
        self.cls_score = nn.Linear(1024, num_classes)
        self.bbox_pred = nn.Linear(1024, num_classes * 4)

        # Mask head
        self.mask_head = MaskHead(in_channels=256, num_classes=num_classes)

    def forward(self, images, targets=None):
        # Extract features
        features = self.backbone(images)

        # Generate proposals
        proposals, rpn_losses = self.rpn(features, targets)

        # ROI pooling for box head
        box_features = self.roi_align(features, proposals)
        box_features = box_features.flatten(1)
        box_features = self.box_head(box_features)

        # Classification and regression
        class_logits = self.cls_score(box_features)
        box_regression = self.bbox_pred(box_features)

        # ROI pooling for mask head (higher resolution)
        mask_features = self.roi_align_mask(features, proposals)
        mask_logits = self.mask_head(mask_features)

        return {
            'class_logits': class_logits,
            'box_regression': box_regression,
            'mask_logits': mask_logits
        }


def mask_rcnn_inference_pipeline(model, image, score_threshold=0.5):
    """
    Complete Mask R-CNN inference showing how masks are extracted.
    """
    with torch.no_grad():
        outputs = model(image)

    # Filter by score
    scores = F.softmax(outputs['class_logits'], dim=1)
    max_scores, pred_classes = scores.max(dim=1)

    keep = max_scores > score_threshold

    # Get corresponding masks
    # During inference, we select the mask for the predicted class
    # mask_logits shape: (N, num_classes, H, W)
    # We need mask of shape (N, H, W) for each detection

    results = []
    for i, keep_flag in enumerate(keep):
        if keep_flag:
            pred_class = pred_classes[i]
            mask = outputs['mask_logits'][i, pred_class]  # (H, W)
            mask = (mask.sigmoid() > 0.5).float()

            results.append({
                'class': pred_class.item(),
                'score': max_scores[i].item(),
                'mask': mask
            })

    return results

Mask R-CNN's key insight is that mask prediction is naturally a per-pixel binary classification task within each region of interest. By using ROI Align instead of ROI Pool, it maintains spatial precision crucial for accurate mask boundaries.

Panoptic Segmentation

Panoptic segmentation unifies semantic and instance segmentation, providing a complete scene understanding where every pixel is assigned to either a "stuff" class (amorphous regions like sky, road, grass) or a "thing" instance (countable objects like cars, people). This requires combining the strengths of both semantic segmentation (for stuff) and instance segmentation (for things).

PYTHON

class PanopticFPN(nn.Module):
    """
    Panoptic FPN: Unified panoptic segmentation architecture.
    Combines instance branch (things) with semantic branch (stuff + things).
    """

    def __init__(self, backbone, num_things, num_stuff):
        super().__init__()
        self.backbone = backbone
        self.num_things = num_things
        self.num_stuff = num_stuff

        # Instance segmentation branch (Mask R-CNN style)
        self.instance_head = MaskHead(in_channels=256, num_classes=num_things)

        # Semantic segmentation branch
        # Uses FPN features at multiple scales
        self.semantic_head = nn.ModuleList([
            nn.Conv2d(256, 128, 3, padding=1) for _ in range(4)
        ])
        self.semantic_fuse = nn.Conv2d(128 * 4, 128, 3, padding=1)
        self.semantic_pred = nn.Conv2d(128, num_things + num_stuff, 1)

    def forward(self, images):
        # Backbone with FPN
        fpn_features = self.backbone(images)  # Multi-scale features

        # Instance predictions (things)
        instance_outputs = self.process_instances(fpn_features)

        # Semantic predictions (stuff + things)
        semantic_outputs = self.process_semantic(fpn_features)

        return {
            'instances': instance_outputs,
            'semantic': semantic_outputs
        }

    def process_semantic(self, fpn_features):
        """
        Process FPN features for semantic segmentation.
        Upsamples and fuses multi-scale features.
        """
        target_size = fpn_features[0].shape[2:]  # Highest resolution

        processed = []
        for feat, conv in zip(fpn_features, self.semantic_head):
            feat = F.relu(conv(feat))
            feat = F.interpolate(feat, size=target_size,
                                mode='bilinear', align_corners=False)
            processed.append(feat)

        # Fuse multi-scale features
        fused = torch.cat(processed, dim=1)
        fused = F.relu(self.semantic_fuse(fused))

        # Predict semantic labels
        semantic_logits = self.semantic_pred(fused)

        return semantic_logits


def panoptic_fusion(instance_results, semantic_pred, overlap_threshold=0.5):
    """
    Merge instance and semantic predictions into panoptic output.
    Resolves conflicts between overlapping predictions.
    """
    H, W = semantic_pred.shape[1:]
    panoptic_map = torch.zeros(H, W, dtype=torch.long)
    instance_id = 1

    # First, place instance masks (things have priority)
    for result in sorted(instance_results, key=lambda x: x['score'], reverse=True):
        mask = result['mask']
        category = result['class']

        # Check overlap with existing instances
        existing_mask = panoptic_map > 0
        overlap = (mask.bool() & existing_mask).sum() / mask.sum()

        if overlap < overlap_threshold:
            panoptic_map[mask.bool()] = instance_id
            instance_id += 1

    # Fill remaining pixels with stuff predictions
    stuff_mask = panoptic_map == 0
    stuff_pred = semantic_pred.argmax(dim=0)
    panoptic_map[stuff_mask] = stuff_pred[stuff_mask]

    return panoptic_map

Loss Functions for Segmentation

Segmentation tasks require specialized loss functions that handle class imbalance and encourage accurate boundary delineation.

PYTHON

class DiceLoss(nn.Module):
    """
    Dice Loss: Optimizes for IoU-like overlap.
    Handles class imbalance better than cross-entropy.
    """

    def __init__(self, smooth=1.0):
        super().__init__()
        self.smooth = smooth

    def forward(self, predictions, targets):
        """
        Args:
            predictions: (B, C, H, W) logits
            targets: (B, H, W) class indices
        """
        num_classes = predictions.shape[1]
        predictions = F.softmax(predictions, dim=1)

        # One-hot encode targets
        targets_one_hot = F.one_hot(targets, num_classes)  # (B, H, W, C)
        targets_one_hot = targets_one_hot.permute(0, 3, 1, 2).float()  # (B, C, H, W)

        # Flatten spatial dimensions
        predictions = predictions.flatten(2)  # (B, C, H*W)
        targets_one_hot = targets_one_hot.flatten(2)  # (B, C, H*W)

        # Compute dice score per class
        intersection = (predictions * targets_one_hot).sum(dim=2)
        union = predictions.sum(dim=2) + targets_one_hot.sum(dim=2)

        dice = (2 * intersection + self.smooth) / (union + self.smooth)

        return 1 - dice.mean()


class FocalLossSegmentation(nn.Module):
    """
    Focal Loss adapted for semantic segmentation.
    Addresses class imbalance by down-weighting easy pixels.
    """

    def __init__(self, alpha=0.25, gamma=2.0, ignore_index=255):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.ignore_index = ignore_index

    def forward(self, predictions, targets):
        """
        Args:
            predictions: (B, C, H, W) logits
            targets: (B, H, W) class indices
        """
        # Standard cross-entropy
        ce_loss = F.cross_entropy(predictions, targets,
                                  reduction='none',
                                  ignore_index=self.ignore_index)

        # Compute pt (probability of true class)
        pt = torch.exp(-ce_loss)

        # Focal weight
        focal_weight = self.alpha * (1 - pt) ** self.gamma

        loss = focal_weight * ce_loss

        return loss.mean()


class CombinedSegmentationLoss(nn.Module):
    """
    Combined loss: CE + Dice for robust training.
    """

    def __init__(self, ce_weight=1.0, dice_weight=1.0, ignore_index=255):
        super().__init__()
        self.ce_weight = ce_weight
        self.dice_weight = dice_weight
        self.ce_loss = nn.CrossEntropyLoss(ignore_index=ignore_index)
        self.dice_loss = DiceLoss()

    def forward(self, predictions, targets):
        ce = self.ce_loss(predictions, targets)
        dice = self.dice_loss(predictions, targets)
        return self.ce_weight * ce + self.dice_weight * dice


class BoundaryLoss(nn.Module):
    """
    Boundary-aware loss: emphasizes accurate edge prediction.
    Useful for medical imaging where boundary precision matters.
    """

    def __init__(self):
        super().__init__()
        # Sobel-like kernels for boundary detection
        self.sobel_x = torch.tensor([[-1, 0, 1],
                                      [-2, 0, 2],
                                      [-1, 0, 1]], dtype=torch.float32).view(1, 1, 3, 3)
        self.sobel_y = torch.tensor([[-1, -2, -1],
                                      [ 0,  0,  0],
                                      [ 1,  2,  1]], dtype=torch.float32).view(1, 1, 3, 3)

    def get_boundary(self, mask):
        """Extract boundary pixels from segmentation mask."""
        mask = mask.float().unsqueeze(1)

        sobel_x = self.sobel_x.to(mask.device)
        sobel_y = self.sobel_y.to(mask.device)

        grad_x = F.conv2d(mask, sobel_x, padding=1)
        grad_y = F.conv2d(mask, sobel_y, padding=1)

        boundary = (grad_x.abs() + grad_y.abs()) > 0
        return boundary.squeeze(1)

    def forward(self, predictions, targets):
        pred_mask = predictions.argmax(dim=1)
        pred_boundary = self.get_boundary(pred_mask)
        target_boundary = self.get_boundary(targets)

        # Binary cross-entropy on boundary regions
        boundary_loss = F.binary_cross_entropy_with_logits(
            pred_boundary.float(), target_boundary.float()
        )

        return boundary_loss

Key Takeaways

Image segmentation provides pixel-level scene understanding, with different variants serving different needs. Semantic segmentation assigns class labels to every pixel, ideal for scene parsing and autonomous driving. Instance segmentation identifies individual object instances, essential for counting and tracking. Panoptic segmentation unifies both approaches for complete scene understanding. The encoder-decoder architecture with skip connections, pioneered by FCN and refined in U-Net, remains the dominant paradigm. Dilated convolutions in DeepLab enable multi-scale context capture without sacrificing resolution. Mask R-CNN extends object detection to instance segmentation by adding a mask prediction branch. Loss functions like Dice loss and boundary-aware losses address class imbalance and emphasize accurate boundaries. Understanding these architectures and their trade-offs enables practitioners to select appropriate methods for applications ranging from medical image analysis to autonomous systems.

20.5 Transfer Learning for Vision Intermediate

Transfer Learning for Vision

Transfer learning has revolutionized computer vision by enabling practitioners to leverage knowledge from large-scale datasets and powerful pretrained models to solve tasks with limited data. Rather than training neural networks from scratch, which requires massive datasets and computational resources, transfer learning initializes networks with weights learned from related tasks, dramatically reducing training time and data requirements while often achieving superior performance. This approach has become the default paradigm for most practical computer vision applications.

Why Transfer Learning Works

Neural networks learn hierarchical feature representations, with early layers detecting simple patterns like edges and textures, and deeper layers combining these into increasingly complex and task-specific features. Crucially, the low-level features learned on one task transfer remarkably well to other tasks because edges, corners, and textures are universal visual building blocks. A network trained to recognize cats and dogs learns edge detectors and texture filters that are equally useful for recognizing medical abnormalities or satellite imagery.

PYTHON

import torch
import torch.nn as nn
import torchvision.models as models
from torchvision import transforms

def visualize_transfer_learning_concept():
    """
    Demonstrate why transfer learning works: early layers learn generic features.
    """
    # Load pretrained ResNet
    model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

    # Layer types and what they typically learn:
    layer_info = {
        'conv1': 'Edges, color gradients, simple textures',
        'layer1': 'Corners, simple shapes, oriented edges',
        'layer2': 'Texture patterns, object parts',
        'layer3': 'Object parts, semantic regions',
        'layer4': 'High-level object representations, scene context',
        'fc': 'Task-specific class predictions (ImageNet 1000 classes)'
    }

    print("ResNet-50 Layer Hierarchy and Feature Types:")
    print("=" * 60)
    for layer, description in layer_info.items():
        print(f"{layer:12} -> {description}")

    print("\n" + "=" * 60)
    print("Transfer Learning Insight:")
    print("- Early layers (conv1, layer1, layer2) are HIGHLY transferable")
    print("- Middle layers (layer3) are moderately transferable")
    print("- Late layers (layer4, fc) are task-specific, often replaced")

visualize_transfer_learning_concept()


def compare_random_vs_pretrained():
    """
    Compare randomly initialized vs pretrained weights.
    """
    # Random initialization
    model_random = models.resnet50(weights=None)

    # Pretrained on ImageNet
    model_pretrained = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)

    # Compare first conv layer weights
    conv1_random = model_random.conv1.weight.data
    conv1_pretrained = model_pretrained.conv1.weight.data

    print("First Conv Layer Statistics:")
    print(f"Random init - mean: {conv1_random.mean():.4f}, std: {conv1_random.std():.4f}")
    print(f"Pretrained - mean: {conv1_pretrained.mean():.4f}, std: {conv1_pretrained.std():.4f}")

    # Pretrained weights have learned meaningful patterns
    # that approximate Gabor filters and edge detectors

The effectiveness of transfer learning depends on the similarity between source and target domains. When domains are closely related (e.g., natural images to natural images), even deep features transfer well. For more distant domains (e.g., natural images to medical scans), earlier features transfer better than later ones.

Loading Pretrained Models

PyTorch's torchvision provides pretrained models trained on ImageNet, the de facto standard for vision pretraining. These models have learned rich visual representations from 1.2 million images across 1000 categories.

PYTHON

import torchvision.models as models
from torchvision.models import ResNet50_Weights, EfficientNet_B0_Weights

def load_pretrained_models_comparison():
    """
    Load various pretrained models and compare their characteristics.
    """
    # Modern API with explicit weights
    resnet50 = models.resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
    efficientnet = models.efficientnet_b0(weights=EfficientNet_B0_Weights.IMAGENET1K_V1)
    convnext = models.convnext_tiny(weights=models.ConvNeXt_Tiny_Weights.IMAGENET1K_V1)

    # Compare model sizes
    models_info = {
        'ResNet-50': resnet50,
        'EfficientNet-B0': efficientnet,
        'ConvNeXt-Tiny': convnext
    }

    print("Pretrained Model Comparison:")
    print("=" * 50)
    for name, model in models_info.items():
        params = sum(p.numel() for p in model.parameters())
        print(f"{name:20} Parameters: {params/1e6:.1f}M")

    return models_info


def get_pretrained_transforms():
    """
    Get the preprocessing transforms used during pretraining.
    Using matching transforms is crucial for transfer learning.
    """
    # ResNet family expects specific normalization
    resnet_transforms = ResNet50_Weights.IMAGENET1K_V2.transforms()

    # EfficientNet has different preprocessing
    efficientnet_transforms = EfficientNet_B0_Weights.IMAGENET1K_V1.transforms()

    # Manual equivalent for understanding
    manual_transforms = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])

    print("ImageNet Normalization Statistics:")
    print(f"Mean: [0.485, 0.456, 0.406] (RGB)")
    print(f"Std:  [0.229, 0.224, 0.225] (RGB)")
    print("\nThese values MUST be used when using ImageNet pretrained models!")

    return resnet_transforms, manual_transforms

Fine-Tuning Strategies

Different fine-tuning strategies trade off between computational cost, data requirements, and final performance. The optimal choice depends on how similar the target task is to ImageNet classification and how much labeled data is available.

PYTHON

import torch.nn as nn
import torch.optim as optim

class TransferLearningStrategies:
    """
    Different approaches to adapting pretrained models.
    """

    @staticmethod
    def feature_extraction(model, num_classes):
        """
        Strategy 1: Feature Extraction (Frozen Backbone)
        - Freeze all pretrained layers
        - Only train new classification head
        - Best when: Very limited data, or target task very similar to ImageNet

        Pros: Fast training, prevents overfitting with small datasets
        Cons: Limited adaptation to target domain
        """
        # Freeze all parameters
        for param in model.parameters():
            param.requires_grad = False

        # Replace classifier with new head
        num_features = model.fc.in_features
        model.fc = nn.Linear(num_features, num_classes)

        # Only classifier parameters will be trained
        params_to_train = model.fc.parameters()
        return model, params_to_train

    @staticmethod
    def fine_tune_classifier(model, num_classes, layers_to_unfreeze=1):
        """
        Strategy 2: Fine-tune classifier + last few layers
        - Freeze early layers, fine-tune late layers
        - Best when: Moderate data, similar but not identical domain

        Pros: Better adaptation while preserving low-level features
        Cons: Risk of overfitting late layers
        """
        # First freeze everything
        for param in model.parameters():
            param.requires_grad = False

        # Replace classifier
        num_features = model.fc.in_features
        model.fc = nn.Linear(num_features, num_classes)

        # Unfreeze last layer(s) of backbone
        # For ResNet, layers are: layer1, layer2, layer3, layer4
        layers = [model.layer4, model.layer3, model.layer2, model.layer1]
        for i in range(min(layers_to_unfreeze, len(layers))):
            for param in layers[i].parameters():
                param.requires_grad = True

        # Get trainable parameters
        params_to_train = [p for p in model.parameters() if p.requires_grad]
        return model, params_to_train

    @staticmethod
    def full_fine_tuning(model, num_classes, lower_lr_factor=0.1):
        """
        Strategy 3: Full Fine-tuning with Discriminative Learning Rates
        - Train all layers but with different learning rates
        - Pretrained layers get lower LR, new head gets higher LR
        - Best when: Sufficient data, need maximum adaptation

        Pros: Maximum flexibility and adaptation
        Cons: Computationally expensive, risk of catastrophic forgetting
        """
        # Replace classifier
        num_features = model.fc.in_features
        model.fc = nn.Linear(num_features, num_classes)

        # Create parameter groups with different learning rates
        # Backbone parameters get lower LR
        backbone_params = []
        for name, param in model.named_parameters():
            if 'fc' not in name:
                backbone_params.append(param)

        # New classifier parameters get higher LR
        classifier_params = list(model.fc.parameters())

        param_groups = [
            {'params': backbone_params, 'lr': 1e-4 * lower_lr_factor},
            {'params': classifier_params, 'lr': 1e-4}
        ]

        return model, param_groups


def create_custom_classifier(in_features, num_classes, dropout=0.5):
    """
    Create a more sophisticated classifier head.
    Often better than single linear layer for complex tasks.
    """
    return nn.Sequential(
        nn.Linear(in_features, 512),
        nn.ReLU(inplace=True),
        nn.Dropout(dropout),
        nn.Linear(512, 256),
        nn.ReLU(inplace=True),
        nn.Dropout(dropout),
        nn.Linear(256, num_classes)
    )


# Example: Complete fine-tuning setup
def setup_transfer_learning(num_classes, strategy='fine_tune'):
    """
    Complete setup for transfer learning experiment.
    """
    # Load pretrained model
    model = models.resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)

    strategies = TransferLearningStrategies()

    if strategy == 'feature_extraction':
        model, params = strategies.feature_extraction(model, num_classes)
        optimizer = optim.Adam(params, lr=1e-3)

    elif strategy == 'fine_tune':
        model, params = strategies.fine_tune_classifier(model, num_classes,
                                                         layers_to_unfreeze=2)
        optimizer = optim.Adam(params, lr=1e-4)

    elif strategy == 'full':
        model, param_groups = strategies.full_fine_tuning(model, num_classes)
        optimizer = optim.Adam(param_groups)

    # Count trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())

    print(f"Strategy: {strategy}")
    print(f"Trainable parameters: {trainable_params:,} / {total_params:,}")
    print(f"Percentage trainable: {100*trainable_params/total_params:.1f}%")

    return model, optimizer

Gradual Unfreezing and Learning Rate Scheduling

Progressive unfreezing starts with only the classifier trainable, then gradually unfreezes deeper layers as training progresses. This approach, combined with careful learning rate scheduling, helps prevent catastrophic forgetting while allowing deep adaptation.

PYTHON

import torch
from torch.optim.lr_scheduler import OneCycleLR, CosineAnnealingWarmRestarts

class GradualUnfreezing:
    """
    Progressively unfreeze layers during training.
    """

    def __init__(self, model, unfreeze_schedule):
        """
        Args:
            model: The neural network
            unfreeze_schedule: Dict mapping epoch -> layers to unfreeze
                              e.g., {0: ['fc'], 3: ['layer4'], 6: ['layer3']}
        """
        self.model = model
        self.schedule = unfreeze_schedule
        self.currently_unfrozen = set()

        # Initially freeze everything
        for param in model.parameters():
            param.requires_grad = False

    def step(self, epoch):
        """Call at the start of each epoch."""
        if epoch in self.schedule:
            layers_to_unfreeze = self.schedule[epoch]

            for layer_name in layers_to_unfreeze:
                layer = getattr(self.model, layer_name)
                for param in layer.parameters():
                    param.requires_grad = True
                self.currently_unfrozen.add(layer_name)

            print(f"Epoch {epoch}: Unfroze {layers_to_unfreeze}")
            print(f"Currently trainable: {self.currently_unfrozen}")

    def get_trainable_params(self):
        """Get list of currently trainable parameters."""
        return [p for p in self.model.parameters() if p.requires_grad]


def setup_discriminative_lr_schedule(model, base_lr=1e-4, lr_mult=0.9):
    """
    Create optimizer with discriminative learning rates.
    Earlier layers get progressively smaller learning rates.
    """
    # Group parameters by their depth in the network
    param_groups = []

    # ResNet structure: conv1 -> layer1 -> layer2 -> layer3 -> layer4 -> fc
    layers = ['conv1', 'bn1', 'layer1', 'layer2', 'layer3', 'layer4', 'fc']

    for i, layer_name in enumerate(layers):
        layer = getattr(model, layer_name, None)
        if layer is not None:
            # Calculate learning rate for this layer
            # Deeper layers get higher LR (closer to base_lr)
            depth_factor = lr_mult ** (len(layers) - i - 1)
            layer_lr = base_lr * depth_factor

            param_groups.append({
                'params': layer.parameters(),
                'lr': layer_lr,
                'name': layer_name
            })

    # Print learning rate schedule
    print("Discriminative Learning Rates:")
    for group in param_groups:
        print(f"  {group['name']:10} -> LR: {group['lr']:.2e}")

    optimizer = optim.Adam(param_groups)
    return optimizer


def create_lr_scheduler_for_finetuning(optimizer, num_epochs, steps_per_epoch):
    """
    Learning rate schedulers suited for fine-tuning.
    """
    # Option 1: OneCycleLR - Fast convergence with warmup and annealing
    one_cycle = OneCycleLR(
        optimizer,
        max_lr=1e-3,
        epochs=num_epochs,
        steps_per_epoch=steps_per_epoch,
        pct_start=0.3,  # 30% warmup
        anneal_strategy='cos'
    )

    # Option 2: Cosine annealing with warm restarts
    cosine_restarts = CosineAnnealingWarmRestarts(
        optimizer,
        T_0=10,  # Restart every 10 epochs
        T_mult=2,  # Double period after each restart
        eta_min=1e-6
    )

    return one_cycle


class WarmupScheduler:
    """
    Linear warmup followed by decay.
    Crucial for stable fine-tuning of large models.
    """

    def __init__(self, optimizer, warmup_epochs, total_epochs, warmup_factor=0.1):
        self.optimizer = optimizer
        self.warmup_epochs = warmup_epochs
        self.total_epochs = total_epochs
        self.warmup_factor = warmup_factor
        self.base_lrs = [group['lr'] for group in optimizer.param_groups]

    def step(self, epoch):
        if epoch < self.warmup_epochs:
            # Linear warmup
            factor = self.warmup_factor + (1 - self.warmup_factor) * epoch / self.warmup_epochs
        else:
            # Cosine decay
            progress = (epoch - self.warmup_epochs) / (self.total_epochs - self.warmup_epochs)
            factor = 0.5 * (1 + torch.cos(torch.tensor(progress * 3.14159)))

        for i, group in enumerate(self.optimizer.param_groups):
            group['lr'] = self.base_lrs[i] * factor

        return factor

Data Augmentation for Transfer Learning

Appropriate data augmentation is crucial for preventing overfitting, especially when fine-tuning with limited data. The augmentation strategy should match the target domain while preserving the semantic content of images.

PYTHON

import torchvision.transforms as T
from torchvision.transforms import AutoAugment, AutoAugmentPolicy
import torch

def create_transfer_learning_augmentation(strategy='moderate'):
    """
    Data augmentation strategies for transfer learning.
    """
    # Base transforms (must match pretrained model)
    normalize = T.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )

    if strategy == 'minimal':
        # Minimal augmentation for feature extraction
        # When using frozen backbone, minimal augmentation often works well
        train_transform = T.Compose([
            T.Resize(256),
            T.CenterCrop(224),
            T.ToTensor(),
            normalize
        ])

    elif strategy == 'moderate':
        # Standard augmentation for fine-tuning
        train_transform = T.Compose([
            T.RandomResizedCrop(224, scale=(0.8, 1.0)),
            T.RandomHorizontalFlip(),
            T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.1),
            T.ToTensor(),
            normalize
        ])

    elif strategy == 'aggressive':
        # Strong augmentation when more adaptation is needed
        train_transform = T.Compose([
            T.RandomResizedCrop(224, scale=(0.5, 1.0)),
            T.RandomHorizontalFlip(),
            T.RandomVerticalFlip() if appropriate_for_domain else T.Lambda(lambda x: x),
            T.RandomRotation(30),
            T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
            T.RandomGrayscale(p=0.1),
            T.GaussianBlur(kernel_size=3, sigma=(0.1, 2.0)),
            T.ToTensor(),
            normalize,
            T.RandomErasing(p=0.2)
        ])

    elif strategy == 'autoaugment':
        # Learned augmentation policies
        train_transform = T.Compose([
            T.RandomResizedCrop(224),
            T.RandomHorizontalFlip(),
            AutoAugment(AutoAugmentPolicy.IMAGENET),
            T.ToTensor(),
            normalize
        ])

    # Validation transform (no augmentation)
    val_transform = T.Compose([
        T.Resize(256),
        T.CenterCrop(224),
        T.ToTensor(),
        normalize
    ])

    return train_transform, val_transform


class MixupCutmixAugmentation:
    """
    MixUp and CutMix augmentation for transfer learning.
    Particularly effective when fine-tuning with limited data.
    """

    def __init__(self, mixup_alpha=0.2, cutmix_alpha=1.0, prob=0.5):
        self.mixup_alpha = mixup_alpha
        self.cutmix_alpha = cutmix_alpha
        self.prob = prob

    def __call__(self, images, labels):
        """
        Apply either MixUp or CutMix to a batch.

        Args:
            images: (B, C, H, W) tensor
            labels: (B,) tensor of class indices

        Returns:
            Mixed images and soft labels
        """
        batch_size = images.shape[0]

        # Decide whether to apply augmentation
        if torch.rand(1) > self.prob:
            return images, labels

        # Choose between MixUp and CutMix
        use_cutmix = torch.rand(1) > 0.5

        # Generate mixing coefficient
        if use_cutmix:
            lam = torch.distributions.Beta(self.cutmix_alpha, self.cutmix_alpha).sample()
        else:
            lam = torch.distributions.Beta(self.mixup_alpha, self.mixup_alpha).sample()

        # Random permutation for mixing pairs
        index = torch.randperm(batch_size)

        if use_cutmix:
            # CutMix: replace rectangular region
            images, lam = self._cutmix(images, images[index], lam)
        else:
            # MixUp: linear interpolation
            images = lam * images + (1 - lam) * images[index]

        # Create soft labels
        labels_a, labels_b = labels, labels[index]

        return images, (labels_a, labels_b, lam)

    def _cutmix(self, img1, img2, lam):
        """Apply CutMix by pasting rectangular region."""
        _, _, H, W = img1.shape

        # Calculate cut dimensions
        cut_ratio = torch.sqrt(1 - lam)
        cut_h = int(H * cut_ratio)
        cut_w = int(W * cut_ratio)

        # Random center
        cy = torch.randint(H, (1,))
        cx = torch.randint(W, (1,))

        # Bounding box
        y1 = torch.clamp(cy - cut_h // 2, 0, H).item()
        y2 = torch.clamp(cy + cut_h // 2, 0, H).item()
        x1 = torch.clamp(cx - cut_w // 2, 0, W).item()
        x2 = torch.clamp(cx + cut_w // 2, 0, W).item()

        # Apply cut
        img1[:, :, y1:y2, x1:x2] = img2[:, :, y1:y2, x1:x2]

        # Adjust lambda based on actual cut area
        lam = 1 - (y2 - y1) * (x2 - x1) / (H * W)

        return img1, lam


def mixup_criterion(criterion, pred, labels_tuple):
    """
    Loss function for MixUp/CutMix training.
    """
    labels_a, labels_b, lam = labels_tuple
    return lam * criterion(pred, labels_a) + (1 - lam) * criterion(pred, labels_b)

Modern Transfer Learning: Self-Supervised Pretraining

Recent advances in self-supervised learning have produced pretrained models that often outperform ImageNet-supervised pretraining. Methods like DINO, MAE, and CLIP learn representations without manual labels, enabling pretraining on vastly larger datasets.

PYTHON

import torch
import torch.nn as nn

def load_modern_pretrained_models():
    """
    Load models with modern pretraining approaches.
    """
    # CLIP: Contrastive Language-Image Pre-training
    # Learns visual representations aligned with text
    # Excellent zero-shot and few-shot transfer

    # Note: Requires openai/clip or transformers library
    # from transformers import CLIPModel, CLIPProcessor

    # DINOv2: Self-supervised vision transformer
    # State-of-the-art visual features without labels
    dinov2 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')

    # MAE: Masked Autoencoder
    # Reconstructs masked image patches
    # from transformers import ViTMAEModel

    print("Modern pretrained models available:")
    print("- CLIP: Best for vision-language tasks and zero-shot")
    print("- DINOv2: Best general-purpose visual features")
    print("- MAE: Good for dense prediction tasks")

    return dinov2


class DINOv2FeatureExtractor(nn.Module):
    """
    Use DINOv2 as feature extractor for downstream tasks.
    DINOv2 produces high-quality features without ImageNet pretraining.
    """

    def __init__(self, num_classes, freeze_backbone=True):
        super().__init__()

        # Load DINOv2 backbone
        self.backbone = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
        self.feature_dim = self.backbone.embed_dim  # 384 for ViT-S

        if freeze_backbone:
            for param in self.backbone.parameters():
                param.requires_grad = False

        # Classification head
        self.classifier = nn.Sequential(
            nn.Linear(self.feature_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        # DINOv2 outputs class token features
        features = self.backbone(x)  # (B, embed_dim)
        return self.classifier(features)


class CLIPZeroShot:
    """
    Zero-shot classification using CLIP.
    No training required - just provide class names!
    """

    def __init__(self):
        # This would use the actual CLIP library
        # import clip
        # self.model, self.preprocess = clip.load("ViT-B/32")
        pass

    def classify(self, image, class_names):
        """
        Zero-shot classification using text descriptions.

        Args:
            image: PIL Image or tensor
            class_names: List of class names, e.g., ["cat", "dog", "bird"]

        Returns:
            Predicted class and probabilities
        """
        # Create text prompts
        text_prompts = [f"a photo of a {name}" for name in class_names]

        # Encode image and text
        # image_features = self.model.encode_image(image)
        # text_features = self.model.encode_text(text_prompts)

        # Compute similarity
        # similarity = (image_features @ text_features.T)
        # probs = similarity.softmax(dim=-1)

        # return class_names[probs.argmax()], probs
        pass

    def few_shot_classify(self, support_images, support_labels, query_images):
        """
        Few-shot classification by averaging support set features.
        """
        # Encode support images for each class
        # class_prototypes = {}
        # for img, label in zip(support_images, support_labels):
        #     features = self.model.encode_image(img)
        #     if label not in class_prototypes:
        #         class_prototypes[label] = []
        #     class_prototypes[label].append(features)

        # Average to get prototype for each class
        # prototypes = {k: torch.stack(v).mean(0) for k, v in class_prototypes.items()}

        # Classify query images by nearest prototype
        pass

Domain Adaptation

When source and target domains differ significantly, domain adaptation techniques help bridge the gap. These methods align feature distributions or learn domain-invariant representations.

PYTHON

class DomainAdaptationLoss(nn.Module):
    """
    Maximum Mean Discrepancy (MMD) loss for domain adaptation.
    Minimizes distribution difference between source and target features.
    """

    def __init__(self, kernel_type='rbf'):
        super().__init__()
        self.kernel_type = kernel_type

    def gaussian_kernel(self, x, y, sigma=1.0):
        """RBF kernel for MMD computation."""
        diff = x.unsqueeze(1) - y.unsqueeze(0)
        dist_sq = (diff ** 2).sum(dim=2)
        return torch.exp(-dist_sq / (2 * sigma ** 2))

    def forward(self, source_features, target_features):
        """
        Compute MMD between source and target feature distributions.
        """
        # Compute kernel matrices
        K_ss = self.gaussian_kernel(source_features, source_features)
        K_tt = self.gaussian_kernel(target_features, target_features)
        K_st = self.gaussian_kernel(source_features, target_features)

        # MMD^2 = E[K(s,s)] + E[K(t,t)] - 2*E[K(s,t)]
        n_s = source_features.shape[0]
        n_t = target_features.shape[0]

        mmd = (K_ss.sum() / (n_s * n_s) +
               K_tt.sum() / (n_t * n_t) -
               2 * K_st.sum() / (n_s * n_t))

        return mmd


class DomainAdversarialNetwork(nn.Module):
    """
    Domain Adversarial Neural Network (DANN).
    Learns domain-invariant features through adversarial training.
    """

    def __init__(self, backbone, num_classes, feature_dim=2048):
        super().__init__()
        self.backbone = backbone
        self.feature_dim = feature_dim

        # Task classifier
        self.classifier = nn.Sequential(
            nn.Linear(feature_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes)
        )

        # Domain discriminator (tries to distinguish source vs target)
        self.domain_classifier = nn.Sequential(
            nn.Linear(feature_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 2)  # Source or Target
        )

    def forward(self, x, alpha=1.0):
        """
        Args:
            x: Input images
            alpha: Gradient reversal strength (increases during training)
        """
        # Extract features
        features = self.backbone(x)
        features = features.view(features.size(0), -1)

        # Task prediction
        class_output = self.classifier(features)

        # Domain prediction with gradient reversal
        # During backprop, gradients are reversed, making backbone
        # learn domain-invariant features
        reversed_features = GradientReversal.apply(features, alpha)
        domain_output = self.domain_classifier(reversed_features)

        return class_output, domain_output


class GradientReversal(torch.autograd.Function):
    """
    Gradient Reversal Layer for domain adversarial training.
    """
    @staticmethod
    def forward(ctx, x, alpha):
        ctx.alpha = alpha
        return x.view_as(x)

    @staticmethod
    def backward(ctx, grad_output):
        # Reverse gradients with scaling factor alpha
        return -ctx.alpha * grad_output, None

Complete Transfer Learning Pipeline

PYTHON

def complete_transfer_learning_pipeline(
    train_dataset,
    val_dataset,
    num_classes,
    num_epochs=30,
    batch_size=32,
    strategy='fine_tune'
):
    """
    Complete pipeline for transfer learning on a custom dataset.
    """
    # 1. Set up data loaders with appropriate augmentation
    train_transform, val_transform = create_transfer_learning_augmentation('moderate')

    train_dataset.transform = train_transform
    val_dataset.transform = val_transform

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True, num_workers=4
    )
    val_loader = torch.utils.data.DataLoader(
        val_dataset, batch_size=batch_size, shuffle=False, num_workers=4
    )

    # 2. Load pretrained model and set up for transfer learning
    model = models.resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)

    strategies = TransferLearningStrategies()
    if strategy == 'feature_extraction':
        model, params = strategies.feature_extraction(model, num_classes)
        optimizer = optim.Adam(params, lr=1e-3)
    elif strategy == 'fine_tune':
        model, params = strategies.fine_tune_classifier(model, num_classes, 2)
        optimizer = optim.Adam(params, lr=1e-4)
    else:
        model, param_groups = strategies.full_fine_tuning(model, num_classes)
        optimizer = optim.Adam(param_groups)

    # 3. Set up learning rate scheduler
    scheduler = OneCycleLR(
        optimizer,
        max_lr=1e-3,
        epochs=num_epochs,
        steps_per_epoch=len(train_loader)
    )

    # 4. Loss function
    criterion = nn.CrossEntropyLoss()

    # 5. Training loop
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

    best_val_acc = 0.0

    for epoch in range(num_epochs):
        # Training
        model.train()
        train_loss = 0.0
        train_correct = 0
        train_total = 0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            scheduler.step()

            train_loss += loss.item()
            _, predicted = outputs.max(1)
            train_total += labels.size(0)
            train_correct += predicted.eq(labels).sum().item()

        # Validation
        model.eval()
        val_correct = 0
        val_total = 0

        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = outputs.max(1)
                val_total += labels.size(0)
                val_correct += predicted.eq(labels).sum().item()

        train_acc = 100 * train_correct / train_total
        val_acc = 100 * val_correct / val_total

        print(f"Epoch {epoch+1}/{num_epochs}")
        print(f"  Train Loss: {train_loss/len(train_loader):.4f}, Acc: {train_acc:.2f}%")
        print(f"  Val Acc: {val_acc:.2f}%")

        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best_transfer_model.pth')

    print(f"\nBest Validation Accuracy: {best_val_acc:.2f}%")
    return model

Key Takeaways

Transfer learning has become the default approach for computer vision because neural networks learn hierarchical features that transfer across tasks. Early layers learn universal visual primitives like edges and textures, while deeper layers learn increasingly task-specific representations. The choice of fine-tuning strategy depends on data availability and domain similarity: feature extraction works well with limited data and similar domains, while full fine-tuning with discriminative learning rates maximizes adaptation when more data is available. Proper preprocessing using the same normalization as pretraining is crucial for success. Modern self-supervised methods like CLIP and DINOv2 often outperform ImageNet-supervised pretraining, especially for tasks beyond natural image classification. Domain adaptation techniques help when source and target domains differ significantly. Understanding these principles enables practitioners to effectively leverage the billions of parameters and computational resources invested in training large vision models, achieving strong performance even with limited task-specific data.