Beginner 120 min read

Chapter 4: Python Data Science Stack

NumPy, Pandas, Matplotlib, and Seaborn for data manipulation and visualization.

Learning Objectives

["Master NumPy array operations", "Manipulate data with Pandas", "Create visualizations with Matplotlib and Seaborn"]


4.1 NumPy Fundamentals Beginner

NumPy Fundamentals

NumPy (Numerical Python) is the foundation of the entire Python data science ecosystem. Every major ML library—Pandas, Scikit-learn, TensorFlow, PyTorch—is built on top of NumPy's array abstraction. Understanding NumPy deeply will make you more effective with all these tools and help you write faster, more memory-efficient code.

Why NumPy?

Python lists are flexible but slow for numerical computation. They can hold mixed types, so Python must check types at runtime. They store pointers to objects scattered in memory, causing cache misses. And operations require explicit loops.

NumPy arrays solve these problems:

Homogeneous types: All elements have the same type (e.g., all float64), eliminating runtime type checking.

Contiguous memory: Elements are stored in adjacent memory locations, enabling efficient CPU cache utilization and SIMD (Single Instruction, Multiple Data) operations.

Vectorized operations: Operations apply to entire arrays without explicit loops, pushing computation into optimized C code.

Broadcasting: Arrays of different shapes can interact through automatic, memory-efficient expansion.

The performance difference is dramatic—NumPy operations can be 100x faster than equivalent Python loops.

PYTHON
import numpy as np
import time

def numpy_vs_python_speed():
    """Compare NumPy and Python list performance."""
    n = 1_000_000

    # Python list
    py_list = list(range(n))
    start = time.time()
    py_result = [x * 2 for x in py_list]
    py_time = time.time() - start

    # NumPy array
    np_array = np.arange(n)
    start = time.time()
    np_result = np_array * 2
    np_time = time.time() - start

    print("Performance Comparison: NumPy vs Python Lists")
    print("-" * 50)
    print(f"Operation: Multiply {n:,} elements by 2")
    print(f"Python list: {py_time*1000:.2f} ms")
    print(f"NumPy array: {np_time*1000:.2f} ms")
    print(f"Speedup: {py_time/np_time:.1f}x")

numpy_vs_python_speed()

Creating Arrays

NumPy provides many ways to create arrays, each suited to different situations:

PYTHON
import numpy as np

# From Python lists
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])

# Specify data type explicitly
floats = np.array([1, 2, 3], dtype=np.float32)

# Zeros and ones (common for initialization)
zeros = np.zeros((3, 4))       # 3x4 matrix of zeros
ones = np.ones((2, 3))         # 2x3 matrix of ones
empty = np.empty((2, 2))       # Uninitialized (faster, random values)

# Sequences
range_arr = np.arange(0, 10, 2)        # [0, 2, 4, 6, 8] - like Python range
linspace = np.linspace(0, 1, 5)        # [0, 0.25, 0.5, 0.75, 1] - evenly spaced

# Identity matrix (useful in linear algebra)
eye = np.eye(3)                # 3x3 identity matrix

# Random arrays (crucial for ML initialization)
uniform = np.random.rand(3, 3)          # Uniform [0, 1)
normal = np.random.randn(3, 3)          # Standard normal
integers = np.random.randint(0, 10, (3, 3))  # Random integers

print("Array Creation Examples")
print("-" * 50)
print(f"1D array: {arr}")
print(f"2D matrix:\n{matrix}")
print(f"Zeros shape {zeros.shape}:\n{zeros}")
print(f"Linspace: {linspace}")
print(f"Random normal:\n{np.round(normal, 3)}")

Array Attributes

Every NumPy array has attributes that describe its structure:

PYTHON
import numpy as np

arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

print("Array Attributes")
print("-" * 50)
print(f"Array:\n{arr}\n")
print(f"shape: {arr.shape}")       # (3, 4) - 3 rows, 4 columns
print(f"ndim: {arr.ndim}")         # 2 - number of dimensions
print(f"size: {arr.size}")         # 12 - total elements
print(f"dtype: {arr.dtype}")       # int64 - data type
print(f"itemsize: {arr.itemsize}") # 8 bytes per element
print(f"nbytes: {arr.nbytes}")     # 96 total bytes (12 * 8)

Shape is the most important attribute. It's a tuple giving the size along each dimension. A (3, 4) array has 3 rows and 4 columns. A (2, 3, 4) array is a "stack" of 2 matrices, each 3×4.

Indexing and Slicing

NumPy extends Python's slicing syntax to multiple dimensions:

PYTHON
import numpy as np

arr = np.arange(12).reshape(3, 4)
print(f"Array:\n{arr}\n")

# Single element
print(f"arr[0, 0] = {arr[0, 0]}")    # First element
print(f"arr[2, 3] = {arr[2, 3]}")    # Last element
print(f"arr[-1, -1] = {arr[-1, -1]}")  # Same as above (negative indexing)

# Slicing: arr[row_slice, col_slice]
print(f"\narr[0, :] = {arr[0, :]}  # First row")
print(f"arr[:, 0] = {arr[:, 0]}  # First column")
print(f"arr[0:2, 1:3] =\n{arr[0:2, 1:3]}  # Submatrix")

# Boolean indexing (very powerful!)
print(f"\narr[arr > 5] = {arr[arr > 5]}  # Elements > 5")

# Fancy indexing (index with arrays)
rows = np.array([0, 2])
cols = np.array([1, 3])
print(f"arr[rows, cols] = {arr[rows, cols]}  # Elements at (0,1) and (2,3)")

Important: NumPy slices are views, not copies. Modifying a slice modifies the original array:

PYTHON
import numpy as np

arr = np.array([1, 2, 3, 4, 5])
slice_view = arr[1:4]
slice_view[0] = 99

print("Views vs Copies")
print("-" * 50)
print(f"After modifying slice: arr = {arr}")
print("The slice is a view—changes affect original!")

# To get a copy:
arr = np.array([1, 2, 3, 4, 5])
slice_copy = arr[1:4].copy()
slice_copy[0] = 99
print(f"After modifying copy: arr = {arr}")
print("The copy is independent.")

Reshaping Arrays

Reshaping changes an array's dimensions without changing its data:

PYTHON
import numpy as np

arr = np.arange(12)
print(f"Original: {arr} (shape: {arr.shape})")

# Reshape to 2D
reshaped = arr.reshape(3, 4)
print(f"\nReshaped to (3, 4):\n{reshaped}")

# Use -1 to auto-calculate one dimension
auto_rows = arr.reshape(-1, 4)  # NumPy figures out rows = 3
print(f"\nreshape(-1, 4) auto-calculates rows:\n{auto_rows}")

# Flatten back to 1D
flat = reshaped.flatten()  # Returns a copy
ravel = reshaped.ravel()   # Returns a view (faster)
print(f"\nFlattened: {flat}")

# Transpose (swap axes)
transposed = reshaped.T
print(f"\nTransposed (4, 3):\n{transposed}")

# Add dimension (useful for broadcasting)
col_vector = arr.reshape(-1, 1)  # Shape: (12, 1)
row_vector = arr.reshape(1, -1)  # Shape: (1, 12)
# Or use np.newaxis
col_vector = arr[:, np.newaxis]  # Same result

Broadcasting

Broadcasting is NumPy's way of performing operations on arrays with different shapes. Instead of copying data, NumPy "broadcasts" smaller arrays across larger ones.

Broadcasting rules:

  1. If arrays have different numbers of dimensions, prepend 1s to the smaller shape
  2. Arrays are compatible along a dimension if they have the same size or one of them is 1
  3. The result shape is the maximum along each dimension

PYTHON
import numpy as np

# Scalar broadcast: adds 10 to every element
arr = np.array([[1, 2, 3], [4, 5, 6]])
result = arr + 10
print("Scalar broadcasting:")
print(f"arr + 10 =\n{result}\n")

# Row vector broadcast: adds [1, 2, 3] to each row
row = np.array([1, 2, 3])
result = arr + row
print("Row vector broadcasting:")
print(f"arr (2,3) + row (3,) =\n{result}\n")

# Column vector broadcast: adds [[10], [20]] to each column
col = np.array([[10], [20]])
result = arr + col
print("Column vector broadcasting:")
print(f"arr (2,3) + col (2,1) =\n{result}\n")

# Outer product via broadcasting
a = np.array([1, 2, 3])
b = np.array([10, 20])
outer = a[:, np.newaxis] * b[np.newaxis, :]
print("Outer product via broadcasting:")
print(f"a (3,1) * b (1,2) =\n{outer}")

Broadcasting eliminates loops and temporary arrays, making code both faster and more readable.

Vectorized Operations

Vectorization means expressing operations on entire arrays rather than individual elements. NumPy applies operations element-wise:

PYTHON
import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print("Vectorized Operations")
print("-" * 50)
print(f"a = {a}")
print(f"b = {b}\n")

# Arithmetic
print(f"a + b = {a + b}")
print(f"a * b = {a * b}")
print(f"a ** 2 = {a ** 2}")

# Comparisons (return boolean arrays)
print(f"a > 2 = {a > 2}")
print(f"a == b/10 = {a == b/10}")

# Universal functions (ufuncs)
print(f"\nnp.sqrt(a) = {np.sqrt(a)}")
print(f"np.exp(a) = {np.round(np.exp(a), 2)}")
print(f"np.sin(a) = {np.round(np.sin(a), 3)}")
print(f"np.log(a) = {np.round(np.log(a), 3)}")

Aggregation Functions

NumPy provides functions to compute statistics across arrays:

PYTHON
import numpy as np

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"Array:\n{arr}\n")

# Global aggregations
print(f"sum: {arr.sum()}")
print(f"mean: {arr.mean()}")
print(f"std: {arr.std():.4f}")
print(f"min: {arr.min()}, max: {arr.max()}")

# Aggregation along axes
print(f"\nSum along axis=0 (columns): {arr.sum(axis=0)}")
print(f"Sum along axis=1 (rows): {arr.sum(axis=1)}")
print(f"Mean along axis=0: {arr.mean(axis=0)}")

# Useful for ML
print(f"\nargmax (index of max): {arr.argmax()}")
print(f"argmax per row: {arr.argmax(axis=1)}")

# Cumulative operations
print(f"\nCumulative sum: {np.cumsum(arr.flatten())}")

The axis parameter is crucial: axis=0 operates along rows (giving one result per column), axis=1 operates along columns (giving one result per row).

Linear Algebra Operations

NumPy provides essential linear algebra operations for ML:

PYTHON
import numpy as np

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
v = np.array([1, 2])

print("Linear Algebra Operations")
print("-" * 50)

# Matrix multiplication (NOT element-wise *)
print(f"A @ B (matrix multiply):\n{A @ B}\n")
print(f"A @ v (matrix-vector):\n{A @ v}\n")

# Also: np.dot(A, B), np.matmul(A, B)

# Transpose
print(f"A.T (transpose):\n{A.T}\n")

# Determinant and inverse
print(f"det(A) = {np.linalg.det(A):.4f}")
print(f"inv(A):\n{np.linalg.inv(A)}\n")

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")
print(f"Eigenvectors:\n{eigenvectors}\n")

# Solve linear system Ax = b
b = np.array([5, 11])
x = np.linalg.solve(A, b)
print(f"Solution to Ax = b: x = {x}")
print(f"Verify: A @ x = {A @ x}")

Random Number Generation

ML relies heavily on randomness for initialization, sampling, and stochastic algorithms:

PYTHON
import numpy as np

# Set seed for reproducibility
np.random.seed(42)

print("Random Number Generation")
print("-" * 50)

# Basic distributions
uniform = np.random.rand(5)           # Uniform [0, 1)
normal = np.random.randn(5)           # Standard normal N(0, 1)
integers = np.random.randint(0, 10, 5)  # Integers in [0, 10)

print(f"Uniform [0,1): {np.round(uniform, 3)}")
print(f"Standard normal: {np.round(normal, 3)}")
print(f"Random integers [0,10): {integers}")

# Parameterized distributions
custom_normal = np.random.normal(loc=5, scale=2, size=5)  # N(5, 2²)
custom_uniform = np.random.uniform(low=-1, high=1, size=5)

print(f"\nN(5, 2²): {np.round(custom_normal, 3)}")
print(f"Uniform [-1, 1): {np.round(custom_uniform, 3)}")

# Shuffling and sampling
arr = np.arange(10)
np.random.shuffle(arr)  # In-place shuffle
print(f"\nShuffled array: {arr}")

choices = np.random.choice([1, 2, 3, 4, 5], size=3, replace=False)
print(f"Random choices (no replacement): {choices}")

# For ML: Xavier/He initialization
fan_in, fan_out = 100, 50
xavier_std = np.sqrt(2.0 / (fan_in + fan_out))
weights = np.random.randn(fan_in, fan_out) * xavier_std
print(f"\nXavier init weights std: {weights.std():.4f} (target: {xavier_std:.4f})")

Memory Efficiency Tips

Understanding memory layout helps write faster code:

PYTHON
import numpy as np

# Contiguous memory: C (row-major) vs F (column-major) order
arr_c = np.array([[1, 2, 3], [4, 5, 6]], order='C')  # Default
arr_f = np.array([[1, 2, 3], [4, 5, 6]], order='F')

print("Memory Layout")
print("-" * 50)
print(f"C order (row-major): {arr_c.flags['C_CONTIGUOUS']}")
print(f"F order (col-major): {arr_f.flags['F_CONTIGUOUS']}")

# Views vs copies
arr = np.arange(10)
view = arr[::2]      # View (shares memory)
copy = arr[::2].copy()  # Copy (separate memory)

print(f"\nView shares memory: {np.shares_memory(arr, view)}")
print(f"Copy shares memory: {np.shares_memory(arr, copy)}")

# In-place operations save memory
arr = np.arange(1000000, dtype=np.float64)
arr += 1          # In-place: no new array
arr = arr + 1     # Creates new array (more memory)

# Use appropriate dtypes
float64_arr = np.zeros(1000000, dtype=np.float64)
float32_arr = np.zeros(1000000, dtype=np.float32)
print(f"\nfloat64 array: {float64_arr.nbytes / 1e6:.1f} MB")
print(f"float32 array: {float32_arr.nbytes / 1e6:.1f} MB")

Key Takeaways

  • NumPy arrays are faster than Python lists due to contiguous memory and vectorization
  • Broadcasting enables operations on different-shaped arrays without copying data
  • Vectorization replaces loops with array operations for dramatic speedups
  • Axis parameter in aggregations: axis=0 for columns, axis=1 for rows
  • @ operator for matrix multiplication (not *)
  • Views vs copies: slices are views; use .copy() when needed
  • Reproducibility: always set random seeds for experiments
  • Memory efficiency: use appropriate dtypes, in-place operations, and views when possible

4.2 Pandas for Data Manipulation Beginner

Pandas for Data Manipulation

While NumPy excels at numerical computation, real-world data comes with labels, mixed types, and missing values. Pandas builds on NumPy to provide data structures that handle these complexities elegantly. It's the standard tool for data loading, cleaning, exploration, and preparation in Python ML workflows.

Series and DataFrames

Pandas has two primary data structures:

Series: A one-dimensional labeled array. Think of it as a NumPy array with an index.

DataFrame: A two-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.

PYTHON
import pandas as pd
import numpy as np

# Creating a Series
s = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print("Series:")
print(s)
print(f"\nIndex: {s.index.tolist()}")
print(f"Values: {s.values}")

# Creating a DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 60000, 70000, 55000],
    'department': ['Engineering', 'Marketing', 'Engineering', 'Sales']
})
print("\nDataFrame:")
print(df)
print(f"\nShape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index.tolist()}")

Loading Data

Pandas can read data from many formats. CSV is most common:

PYTHON
import pandas as pd

# Reading CSV (most common)
# df = pd.read_csv('data.csv')

# Common parameters:
# pd.read_csv('data.csv',
#     sep=',',              # Delimiter
#     header=0,             # Row number for column names (None if no header)
#     index_col=0,          # Column to use as index
#     usecols=['a', 'b'],   # Only load specific columns
#     dtype={'a': float},   # Specify column types
#     na_values=['NA', ''], # Additional strings to recognize as NA
#     nrows=1000,           # Only read first n rows
#     parse_dates=['date'], # Parse date columns
# )

# Other formats:
# pd.read_excel('data.xlsx')
# pd.read_json('data.json')
# pd.read_sql('SELECT * FROM table', connection)
# pd.read_parquet('data.parquet')  # Fast columnar format

# Create sample DataFrame for demonstration
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5),
    'product': ['A', 'B', 'A', 'C', 'B'],
    'sales': [100, 150, 120, 80, 200],
    'region': ['North', 'South', 'North', 'East', 'South']
})

print("Sample DataFrame:")
print(df)
print(f"\nData types:\n{df.dtypes}")

Inspecting Data

Always start by understanding your data:

PYTHON
import pandas as pd
import numpy as np

# Create sample data with some issues
np.random.seed(42)
df = pd.DataFrame({
    'id': range(100),
    'age': np.random.randint(18, 65, 100),
    'income': np.random.normal(50000, 15000, 100),
    'category': np.random.choice(['A', 'B', 'C'], 100),
    'score': np.random.random(100) * 100
})
df.loc[5, 'income'] = np.nan  # Add some missing values
df.loc[10, 'income'] = np.nan

print("Data Inspection Methods")
print("=" * 50)

# First/last rows
print("df.head():")
print(df.head())

print("\ndf.tail(3):")
print(df.tail(3))

# Basic info
print("\ndf.info():")
df.info()

# Statistics
print("\ndf.describe():")
print(df.describe())

# Check for missing values
print(f"\nMissing values:\n{df.isnull().sum()}")

# Value counts for categorical
print(f"\nCategory distribution:\n{df['category'].value_counts()}")

# Unique values
print(f"\nUnique categories: {df['category'].unique()}")

Selecting Data

Pandas provides multiple ways to select data. Understanding the difference between loc and iloc is essential:

PYTHON
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA']
}, index=['a', 'b', 'c', 'd', 'e'])

print("DataFrame:")
print(df)
print()

# Column selection
print(f"df['name'] (Series):\n{df['name']}\n")
print(f"df[['name', 'age']] (DataFrame):\n{df[['name', 'age']]}\n")

# loc: Label-based selection
print("loc (label-based):")
print(f"df.loc['a'] (row 'a'):\n{df.loc['a']}\n")
print(f"df.loc['a':'c', 'name':'age'] (slice by labels):\n{df.loc['a':'c', 'name':'age']}\n")

# iloc: Integer position-based selection
print("iloc (position-based):")
print(f"df.iloc[0] (first row):\n{df.iloc[0]}\n")
print(f"df.iloc[0:2, 0:2] (slice by position):\n{df.iloc[0:2, 0:2]}\n")

# Boolean indexing (very common in ML)
print("Boolean indexing:")
print(f"df[df['age'] > 28]:\n{df[df['age'] > 28]}\n")

# Combining conditions (use & for AND, | for OR, ~ for NOT)
mask = (df['age'] > 25) & (df['city'] == 'NYC')
print(f"Age > 25 AND city == NYC:\n{df[mask]}")

Key distinction:

  • loc: Uses labels (inclusive of end)
  • iloc: Uses integer positions (exclusive of end, like Python)
  • Handling Missing Data

Real datasets almost always have missing values. Pandas uses NaN (Not a Number) for missing data:

PYTHON
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, 2, 3, 4, 5]
})

print("DataFrame with missing values:")
print(df)
print()

# Detect missing values
print(f"Is null:\n{df.isnull()}\n")
print(f"Count nulls per column:\n{df.isnull().sum()}\n")

# Drop missing values
print(f"dropna() (drop rows with any null):\n{df.dropna()}\n")
print(f"dropna(subset=['A']) (only check column A):\n{df.dropna(subset=['A'])}\n")

# Fill missing values
print(f"fillna(0):\n{df.fillna(0)}\n")
print(f"fillna(method='ffill') (forward fill):\n{df.fillna(method='ffill')}\n")
print(f"fillna(df.mean()) (fill with column means):\n{df.fillna(df.mean())}\n")

# Interpolation
print(f"interpolate():\n{df.interpolate()}")

Strategy depends on context:

  • Drop rows if missing data is rare and random
  • Fill with mean/median for numerical features (destroys variance)
  • Fill with mode for categorical features
  • Use advanced imputation (KNN, MICE) for important features
  • Data Transformation

Pandas makes common transformations straightforward:

PYTHON
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'salary': [50000, 60000, 70000],
    'bonus_pct': [0.1, 0.15, 0.12]
})

print("Original DataFrame:")
print(df)
print()

# Add new columns
df['bonus'] = df['salary'] * df['bonus_pct']
df['total_comp'] = df['salary'] + df['bonus']
print("After adding columns:")
print(df)
print()

# Apply functions
df['name_upper'] = df['name'].str.upper()  # String methods
df['salary_log'] = np.log(df['salary'])    # NumPy functions
df['salary_rank'] = df['salary'].rank()    # Built-in methods
print("After transformations:")
print(df)
print()

# Apply custom function
def categorize_salary(x):
    if x < 55000:
        return 'Low'
    elif x < 65000:
        return 'Medium'
    else:
        return 'High'

df['salary_cat'] = df['salary'].apply(categorize_salary)
print("After custom function:")
print(df[['name', 'salary', 'salary_cat']])

Grouping and Aggregation

groupby is one of Pandas' most powerful features—it enables split-apply-combine operations:

PYTHON
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'department': ['Eng', 'Eng', 'Sales', 'Sales', 'Eng', 'Sales'],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'salary': [70000, 80000, 60000, 65000, 75000, 55000],
    'years': [3, 5, 2, 4, 1, 6]
})

print("DataFrame:")
print(df)
print()

# Basic groupby
print("Mean salary by department:")
print(df.groupby('department')['salary'].mean())
print()

# Multiple aggregations
print("Multiple stats by department:")
print(df.groupby('department')['salary'].agg(['mean', 'min', 'max', 'count']))
print()

# Multiple columns, multiple aggregations
print("Multiple columns and aggregations:")
agg_result = df.groupby('department').agg({
    'salary': ['mean', 'sum'],
    'years': ['mean', 'max']
})
print(agg_result)
print()

# Transform: apply function and return same shape
df['salary_dept_mean'] = df.groupby('department')['salary'].transform('mean')
df['salary_vs_dept'] = df['salary'] - df['salary_dept_mean']
print("After transform (salary vs department mean):")
print(df[['name', 'department', 'salary', 'salary_dept_mean', 'salary_vs_dept']])

Merging and Joining

Combining DataFrames is essential for working with relational data:

PYTHON
import pandas as pd

# Two related tables
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'dept_id': [101, 102, 101, 103]
})

departments = pd.DataFrame({
    'dept_id': [101, 102, 104],
    'dept_name': ['Engineering', 'Marketing', 'Finance']
})

print("Employees:")
print(employees)
print("\nDepartments:")
print(departments)
print()

# Inner join (only matching keys)
inner = pd.merge(employees, departments, on='dept_id', how='inner')
print("Inner join:")
print(inner)
print()

# Left join (all from left, matching from right)
left = pd.merge(employees, departments, on='dept_id', how='left')
print("Left join:")
print(left)
print()

# Outer join (all from both)
outer = pd.merge(employees, departments, on='dept_id', how='outer')
print("Outer join:")
print(outer)

Pivot Tables and Reshaping

Reshape data for analysis and visualization:

PYTHON
import pandas as pd
import numpy as np

# Sales data
df = pd.DataFrame({
    'date': ['2024-01', '2024-01', '2024-02', '2024-02'] * 2,
    'region': ['North', 'South'] * 4,
    'product': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
    'sales': [100, 150, 120, 180, 80, 90, 100, 110]
})

print("Sales data:")
print(df)
print()

# Pivot table
pivot = pd.pivot_table(df, values='sales', index='region',
                       columns='product', aggfunc='sum')
print("Pivot table (regions x products):")
print(pivot)
print()

# Melt (unpivot): wide to long format
wide_df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'math': [90, 85],
    'english': [88, 92],
    'science': [95, 80]
})
print("Wide format:")
print(wide_df)

long_df = pd.melt(wide_df, id_vars=['name'], var_name='subject', value_name='score')
print("\nLong format (melted):")
print(long_df)

Time Series Operations

Pandas has excellent support for time series data:

PYTHON
import pandas as pd
import numpy as np

# Create time series
dates = pd.date_range('2024-01-01', periods=30, freq='D')
ts = pd.Series(np.random.randn(30).cumsum(), index=dates)

print("Time Series:")
print(ts.head(10))
print()

# Date-based indexing
print(f"January 2024: {len(ts['2024-01'])} days")
print(f"Specific date: {ts['2024-01-15']:.4f}")
print()

# Resampling (downsampling to weekly)
weekly = ts.resample('W').mean()
print("Weekly averages:")
print(weekly)
print()

# Rolling windows
ts_df = ts.to_frame(name='value')
ts_df['rolling_mean_7d'] = ts['value'].rolling(window=7).mean()
ts_df['rolling_std_7d'] = ts['value'].rolling(window=7).std()
print("With rolling statistics:")
print(ts_df.tail(10))

Performance Tips

Pandas can be slow with large datasets. Here are optimization strategies:

PYTHON
import pandas as pd
import numpy as np

# 1. Use appropriate dtypes
df = pd.DataFrame({
    'id': range(100000),
    'category': np.random.choice(['A', 'B', 'C'], 100000),
    'value': np.random.random(100000)
})

print("Memory Optimization")
print("-" * 50)
print(f"Before: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")

# Convert to categorical (huge savings for low-cardinality strings)
df['category'] = df['category'].astype('category')
print(f"After categorical: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")

# 2. Use vectorized operations, not loops
# Bad: for i, row in df.iterrows(): ...
# Good: df['new_col'] = df['col1'] * df['col2']

# 3. Use query() for complex filtering
# df.query('age > 25 and city == "NYC"') is often faster than boolean indexing

# 4. Consider chunking for very large files
# for chunk in pd.read_csv('large.csv', chunksize=10000):
#     process(chunk)

print("\nKey performance tips:")
print("• Use categorical dtype for low-cardinality strings")
print("• Avoid iterrows(); use vectorized operations")
print("• Use query() for complex filtering")
print("• Read large files in chunks")

Key Takeaways

  • Series (1D) and DataFrame (2D) are Pandas' core structures
  • Use loc for label-based and iloc for position-based selection
  • Handle missing data explicitly: drop, fill, or interpolate
  • groupby enables powerful split-apply-combine operations
  • merge joins DataFrames like SQL joins
  • Pivot tables reshape data for analysis
  • Datetime index enables powerful time series operations
  • Optimize memory with categorical types and vectorized operations

4.3 Data Visualization Beginner

Data Visualization

Visualization transforms numbers into insight. A well-crafted plot can reveal patterns, outliers, and relationships that tables of numbers obscure. In machine learning, visualization guides feature engineering, helps diagnose model problems, and communicates results to stakeholders.

The Python Visualization Landscape

Python offers several visualization libraries, each with strengths:

Matplotlib: The foundation—low-level, highly customizable, but verbose. Other libraries build on it.

Seaborn: Statistical visualization built on Matplotlib. Beautiful defaults, great for exploring data.

Plotly: Interactive plots for dashboards and presentations. Supports 3D and web embedding.

Altair: Declarative visualization using a grammar of graphics. Concise syntax, good for exploration.

For ML work, Matplotlib and Seaborn cover most needs. We'll focus on these.

Matplotlib Fundamentals

Matplotlib organizes plots hierarchically: Figure (the window) contains Axes (individual plots). Understanding this structure helps create complex visualizations:

PYTHON
import matplotlib.pyplot as plt
import numpy as np

# Basic plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.figure(figsize=(10, 4))
plt.plot(x, y, label='sin(x)')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Simple Line Plot')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

The object-oriented interface gives more control:

PYTHON
import matplotlib.pyplot as plt
import numpy as np

# Object-oriented approach
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

x = np.linspace(0, 10, 100)

# First subplot
axes[0].plot(x, np.sin(x), 'b-', label='sin(x)')
axes[0].plot(x, np.cos(x), 'r--', label='cos(x)')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Trigonometric Functions')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Second subplot
axes[1].plot(x, np.exp(-x/5) * np.sin(x), 'g-', linewidth=2)
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].set_title('Damped Oscillation')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Essential Plot Types for ML

Different data relationships call for different plot types:

PYTHON
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 1. Line plot - trends over time or sequence
x = np.arange(50)
y = np.cumsum(np.random.randn(50))
axes[0, 0].plot(x, y)
axes[0, 0].set_title('Line Plot: Trends')
axes[0, 0].set_xlabel('Time')

# 2. Scatter plot - relationships between variables
x = np.random.randn(100)
y = 2*x + np.random.randn(100)*0.5
axes[0, 1].scatter(x, y, alpha=0.6)
axes[0, 1].set_title('Scatter Plot: Relationships')
axes[0, 1].set_xlabel('Feature X')
axes[0, 1].set_ylabel('Feature Y')

# 3. Histogram - distributions
data = np.random.randn(1000)
axes[0, 2].hist(data, bins=30, edgecolor='black', alpha=0.7)
axes[0, 2].set_title('Histogram: Distributions')
axes[0, 2].set_xlabel('Value')
axes[0, 2].set_ylabel('Frequency')

# 4. Bar plot - categorical comparisons
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
axes[1, 0].bar(categories, values, color='steelblue')
axes[1, 0].set_title('Bar Plot: Comparisons')
axes[1, 0].set_ylabel('Value')

# 5. Box plot - distribution summaries
data = [np.random.randn(100) + i for i in range(4)]
axes[1, 1].boxplot(data, labels=['A', 'B', 'C', 'D'])
axes[1, 1].set_title('Box Plot: Distribution Summary')
axes[1, 1].set_ylabel('Value')

# 6. Heatmap - matrix visualization
matrix = np.random.randn(5, 5)
im = axes[1, 2].imshow(matrix, cmap='coolwarm')
axes[1, 2].set_title('Heatmap: Matrix Data')
plt.colorbar(im, ax=axes[1, 2])

plt.tight_layout()
plt.show()

Seaborn for Statistical Visualization

Seaborn builds on Matplotlib with better defaults and statistical plot types:

PYTHON
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Use built-in dataset for demonstration
np.random.seed(42)
n = 200
df = pd.DataFrame({
    'feature1': np.random.randn(n),
    'feature2': np.random.randn(n),
    'target': np.random.choice(['Class A', 'Class B', 'Class C'], n),
    'value': np.random.randn(n) * 10 + 50
})
df['feature2'] = df['feature1'] * 0.5 + df['feature2'] * 0.5  # Add correlation

# Set style
sns.set_style('whitegrid')

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Distribution plot with KDE
sns.histplot(data=df, x='value', kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Distribution with KDE')

# 2. Scatter with hue (color by category)
sns.scatterplot(data=df, x='feature1', y='feature2', hue='target', ax=axes[0, 1])
axes[0, 1].set_title('Scatter Plot by Class')

# 3. Box plot by category
sns.boxplot(data=df, x='target', y='value', ax=axes[1, 0])
axes[1, 0].set_title('Distribution by Class')

# 4. Violin plot (combines box plot and KDE)
sns.violinplot(data=df, x='target', y='value', ax=axes[1, 1])
axes[1, 1].set_title('Violin Plot by Class')

plt.tight_layout()
plt.show()

Visualizing Distributions

Understanding data distributions is crucial for choosing models and detecting issues:

PYTHON
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(42)

# Create data with different distributions
normal = np.random.randn(1000)
skewed = np.random.exponential(2, 1000)
bimodal = np.concatenate([np.random.randn(500) - 2, np.random.randn(500) + 2])

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

# Histograms
axes[0, 0].hist(normal, bins=30, alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Normal Distribution')

axes[0, 1].hist(skewed, bins=30, alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Skewed (Exponential)')

axes[0, 2].hist(bimodal, bins=30, alpha=0.7, edgecolor='black')
axes[0, 2].set_title('Bimodal Distribution')

# Q-Q plots (check normality)
from scipy import stats

for i, (data, name) in enumerate([(normal, 'Normal'), (skewed, 'Skewed'), (bimodal, 'Bimodal')]):
    stats.probplot(data, dist="norm", plot=axes[1, i])
    axes[1, i].set_title(f'Q-Q Plot: {name}')

plt.tight_layout()
plt.show()

print("Q-Q Plot Interpretation:")
print("• Points on diagonal line → data is normally distributed")
print("• Curved pattern → data is skewed")
print("• S-shape → data has heavy tails")

Visualizing Relationships

Understand how features relate to each other and to the target:

PYTHON
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.random.seed(42)

# Create correlated features
n = 300
df = pd.DataFrame({
    'x1': np.random.randn(n),
    'x2': np.random.randn(n),
})
df['x3'] = df['x1'] * 0.8 + np.random.randn(n) * 0.3
df['x4'] = -df['x2'] * 0.6 + np.random.randn(n) * 0.5
df['target'] = df['x1'] + df['x2'] * 0.5 + np.random.randn(n) * 0.3

# Correlation heatmap
plt.figure(figsize=(8, 6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

# Pair plot (scatter matrix)
plt.figure()
sns.pairplot(df[['x1', 'x2', 'x3', 'target']], diag_kind='kde',
             plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot: Feature Relationships', y=1.02)
plt.show()

Visualizing Model Performance

After training models, visualization helps assess and communicate results:

PYTHON
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Simulated model outputs
np.random.seed(42)
y_true = np.random.randint(0, 2, 200)
y_prob = np.clip(y_true * 0.6 + np.random.randn(200) * 0.3, 0, 1)
y_pred = (y_prob > 0.5).astype(int)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
im = axes[0].imshow(cm, cmap='Blues')
axes[0].set_xticks([0, 1])
axes[0].set_yticks([0, 1])
axes[0].set_xticklabels(['Predicted 0', 'Predicted 1'])
axes[0].set_yticklabels(['Actual 0', 'Actual 1'])
for i in range(2):
    for j in range(2):
        axes[0].text(j, i, str(cm[i, j]), ha='center', va='center', fontsize=20)
axes[0].set_title('Confusion Matrix')
plt.colorbar(im, ax=axes[0])

# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_true, y_prob)
roc_auc = auc(fpr, tpr)
axes[1].plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {roc_auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# 3. Prediction Distribution
axes[2].hist(y_prob[y_true == 0], bins=20, alpha=0.5, label='Actual 0', density=True)
axes[2].hist(y_prob[y_true == 1], bins=20, alpha=0.5, label='Actual 1', density=True)
axes[2].axvline(x=0.5, color='r', linestyle='--', label='Threshold')
axes[2].set_xlabel('Predicted Probability')
axes[2].set_ylabel('Density')
axes[2].set_title('Prediction Distribution')
axes[2].legend()

plt.tight_layout()
plt.show()

Visualizing Training Progress

Monitor model training to detect issues:

PYTHON
import matplotlib.pyplot as plt
import numpy as np

# Simulated training history
epochs = 100
train_loss = 2 * np.exp(-np.arange(epochs) / 20) + 0.1 + np.random.randn(epochs) * 0.05
val_loss = 2 * np.exp(-np.arange(epochs) / 25) + 0.2 + np.random.randn(epochs) * 0.05
val_loss[60:] += np.arange(40) * 0.01  # Simulate overfitting

train_acc = 1 - train_loss / 3 + np.random.randn(epochs) * 0.02
val_acc = 1 - val_loss / 3 + np.random.randn(epochs) * 0.02

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
axes[0].plot(train_loss, label='Training Loss', linewidth=2)
axes[0].plot(val_loss, label='Validation Loss', linewidth=2)
axes[0].axvline(x=60, color='r', linestyle='--', alpha=0.5, label='Overfitting starts')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy curves
axes[1].plot(train_acc, label='Training Accuracy', linewidth=2)
axes[1].plot(val_acc, label='Validation Accuracy', linewidth=2)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training and Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Signs of overfitting:")
print("• Training loss decreases while validation loss increases")
print("• Gap between training and validation metrics grows")
print("• Consider early stopping, regularization, or more data")

Visualization Best Practices

Effective visualization follows principles that enhance understanding:

PYTHON
import matplotlib.pyplot as plt
import numpy as np

# Example: Before and After applying best practices
np.random.seed(42)
x = np.arange(5)
y1 = np.random.randint(20, 80, 5)
y2 = np.random.randint(20, 80, 5)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# BAD: Cluttered, unclear
axes[0].bar(x - 0.2, y1, 0.4, color='red')
axes[0].bar(x + 0.2, y2, 0.4, color='blue')
axes[0].set_title('BAD: Default Everything')
# No labels, no legend, poor colors

# GOOD: Clear, informative
bars1 = axes[1].bar(x - 0.2, y1, 0.4, label='Method A', color='#2ecc71', edgecolor='black')
bars2 = axes[1].bar(x + 0.2, y2, 0.4, label='Method B', color='#3498db', edgecolor='black')
axes[1].set_xlabel('Category', fontsize=12)
axes[1].set_ylabel('Performance Score', fontsize=12)
axes[1].set_title('GOOD: Clear Labels and Legend', fontsize=14)
axes[1].set_xticks(x)
axes[1].set_xticklabels(['Cat A', 'Cat B', 'Cat C', 'Cat D', 'Cat E'])
axes[1].legend(loc='upper right')
axes[1].set_ylim(0, 100)
axes[1].grid(True, axis='y', alpha=0.3)

# Add value labels on bars
for bar in bars1:
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2,
                 f'{bar.get_height():.0f}', ha='center', fontsize=9)
for bar in bars2:
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2,
                 f'{bar.get_height():.0f}', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

print("\nVisualization Best Practices:")
print("• Always label axes with units")
print("• Include legends when showing multiple series")
print("• Use colorblind-friendly palettes")
print("• Start y-axis at 0 for bar charts (usually)")
print("• Avoid 3D charts—they distort perception")
print("• Keep it simple—remove chartjunk")
print("• Use appropriate plot types for your data")

Saving Figures

Save publication-quality figures:

PYTHON
import matplotlib.pyplot as plt
import numpy as np

# Create a figure
fig, ax = plt.subplots(figsize=(8, 6))
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='sin(x)')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Example Figure')
ax.legend()

# Save in different formats
# fig.savefig('figure.png', dpi=300, bbox_inches='tight')  # Raster, good for web
# fig.savefig('figure.pdf', bbox_inches='tight')           # Vector, good for papers
# fig.savefig('figure.svg', bbox_inches='tight')           # Vector, good for editing

print("Saving figures:")
print("• PNG (dpi=300): Web, presentations")
print("• PDF: Publications, scalable")
print("• SVG: Web, editable in vector software")
print("• bbox_inches='tight': Removes excess whitespace")

plt.show()

Key Takeaways

  • Matplotlib provides low-level control; Seaborn adds statistical plots and better defaults
  • Choose plot types based on data: scatter for relationships, histogram for distributions, line for trends
  • Correlation heatmaps and pair plots reveal feature relationships
  • Training curves diagnose overfitting and convergence issues
  • Confusion matrices and ROC curves assess classifier performance
  • Follow best practices: label axes, include legends, use appropriate colors
  • Save figures at high DPI for publications; use vector formats when possible

4.4 Data Preprocessing Intermediate

Data Preprocessing

Raw data is rarely suitable for machine learning algorithms. It contains missing values, inconsistent scales, categorical variables that need encoding, and outliers that can derail training. Data preprocessing transforms messy real-world data into the clean numerical format that algorithms expect.

Preprocessing decisions significantly impact model performance—sometimes more than algorithm choice. This section covers the essential techniques you'll apply to nearly every ML project.

The Preprocessing Pipeline

A typical preprocessing workflow:

  1. Handle missing values: Drop, impute, or flag
  2. Encode categorical variables: Convert strings to numbers
  3. Scale numerical features: Normalize or standardize
  4. Handle outliers: Detect and address extreme values
  5. Feature engineering: Create new informative features

Scikit-learn's Pipeline and ColumnTransformer help organize these steps systematically.

Handling Missing Values

Missing data requires explicit handling. The right strategy depends on why data is missing:

Missing Completely at Random (MCAR): Missingness unrelated to any values. Safe to drop.

Missing at Random (MAR): Missingness related to observed data. Imputation often works.

Missing Not at Random (MNAR): Missingness related to the missing value itself. Most problematic.

PYTHON
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

# Create data with missing values
np.random.seed(42)
df = pd.DataFrame({
    'age': [25, np.nan, 35, 45, np.nan, 55, 30, np.nan],
    'income': [50000, 60000, np.nan, 80000, 45000, np.nan, 55000, 70000],
    'category': ['A', 'B', np.nan, 'A', 'B', 'A', np.nan, 'B']
})

print("Original data:")
print(df)
print(f"\nMissing values:\n{df.isnull().sum()}")

# Strategy 1: Drop rows with any missing values
df_dropped = df.dropna()
print(f"\nAfter dropping: {len(df_dropped)} rows remain")

# Strategy 2: Simple imputation
imputer_mean = SimpleImputer(strategy='mean')
imputer_mode = SimpleImputer(strategy='most_frequent')

df_imputed = df.copy()
df_imputed[['age', 'income']] = imputer_mean.fit_transform(df[['age', 'income']])
df_imputed[['category']] = imputer_mode.fit_transform(df[['category']])
print("\nAfter simple imputation:")
print(df_imputed)

# Strategy 3: KNN imputation (uses similar samples)
knn_imputer = KNNImputer(n_neighbors=2)
df_knn = df.copy()
df_knn[['age', 'income']] = knn_imputer.fit_transform(df[['age', 'income']])
print("\nAfter KNN imputation:")
print(df_knn[['age', 'income']])

Guidelines:

  • Drop rows if <5% missing and MCAR
  • Mean/median imputation is simple but reduces variance
  • KNN imputation preserves relationships but is slower
  • Consider adding a "missing" indicator feature
  • Encoding Categorical Variables

ML algorithms need numerical inputs. Categorical encoding converts strings to numbers:

PYTHON
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'red'],
    'size': ['small', 'medium', 'large', 'small', 'large'],
    'price': [10, 20, 30, 15, 25]
})

print("Original data:")
print(df)

# 1. Label Encoding: Assigns integers to categories
# Use for: ordinal categories, target variable, tree-based models
le = LabelEncoder()
df['color_label'] = le.fit_transform(df['color'])
print("\nLabel encoding (color):")
print(f"Mapping: {dict(zip(le.classes_, range(len(le.classes_))))}")
print(df[['color', 'color_label']])

# 2. One-Hot Encoding: Creates binary columns for each category
# Use for: nominal categories, linear models, neural networks
ohe = OneHotEncoder(sparse_output=False)
color_ohe = ohe.fit_transform(df[['color']])
print("\nOne-hot encoding (color):")
print(pd.DataFrame(color_ohe, columns=ohe.get_feature_names_out(['color'])))

# 3. Ordinal Encoding: For categories with natural order
# Use for: ordinal categories like size, education level
oe = OrdinalEncoder(categories=[['small', 'medium', 'large']])
df['size_ordinal'] = oe.fit_transform(df[['size']])
print("\nOrdinal encoding (size: small=0, medium=1, large=2):")
print(df[['size', 'size_ordinal']])

When to use each:

  • One-hot: Nominal categories with <15 unique values
  • Label/Ordinal: Ordinal categories, tree-based models, or target encoding
  • Target encoding: High-cardinality categories (but watch for leakage!)

PYTHON
import pandas as pd
import numpy as np

# Target encoding example (mean of target per category)
np.random.seed(42)
df = pd.DataFrame({
    'city': ['NYC'] * 20 + ['LA'] * 20 + ['Chicago'] * 10,
    'target': np.concatenate([
        np.random.binomial(1, 0.7, 20),  # NYC: 70% positive
        np.random.binomial(1, 0.3, 20),  # LA: 30% positive
        np.random.binomial(1, 0.5, 10)   # Chicago: 50% positive
    ])
})

# Compute target encoding (careful: do this only on training data!)
target_means = df.groupby('city')['target'].mean()
df['city_encoded'] = df['city'].map(target_means)

print("Target Encoding:")
print(f"City means: {target_means.to_dict()}")
print(df.head(10))
print("\nWarning: Use only training data means to avoid leakage!")

Feature Scaling

Many algorithms (linear models, neural networks, KNN, SVM) are sensitive to feature scales. Scaling puts features on comparable ranges:

PYTHON
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Features with very different scales
np.random.seed(42)
df = pd.DataFrame({
    'age': np.random.randint(18, 80, 100),           # Range: ~18-80
    'income': np.random.randint(20000, 200000, 100), # Range: 20k-200k
    'score': np.random.random(100)                    # Range: 0-1
})

print("Original statistics:")
print(df.describe().round(2))

# StandardScaler: mean=0, std=1
# Use for: most algorithms, normally distributed data
scaler_standard = StandardScaler()
df_standard = pd.DataFrame(
    scaler_standard.fit_transform(df),
    columns=[f'{c}_standard' for c in df.columns]
)
print("\nStandardScaler (z-score):")
print(df_standard.describe().round(2))

# MinMaxScaler: scales to [0, 1]
# Use for: neural networks, algorithms requiring bounded input
scaler_minmax = MinMaxScaler()
df_minmax = pd.DataFrame(
    scaler_minmax.fit_transform(df),
    columns=[f'{c}_minmax' for c in df.columns]
)
print("\nMinMaxScaler [0, 1]:")
print(df_minmax.describe().round(2))

# RobustScaler: uses median and IQR (robust to outliers)
# Use for: data with outliers
scaler_robust = RobustScaler()
df_robust = pd.DataFrame(
    scaler_robust.fit_transform(df),
    columns=[f'{c}_robust' for c in df.columns]
)
print("\nRobustScaler (median, IQR):")
print(df_robust.describe().round(2))

Important: Always fit scalers on training data only, then transform both train and test:

PYTHON
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split first
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

# Fit on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit AND transform
X_test_scaled = scaler.transform(X_test)        # transform only (no fit!)

print("Correct scaling workflow:")
print(f"Train mean: {X_train_scaled.mean(axis=0).round(4)}")  # Should be ~0
print(f"Test mean: {X_test_scaled.mean(axis=0).round(4)}")    # Might not be 0

Handling Outliers

Outliers can disproportionately influence models. Detection and handling strategies:

PYTHON
import numpy as np
import pandas as pd

np.random.seed(42)

# Data with outliers
data = np.random.randn(100) * 10 + 50
data[95:] = [150, 160, 170, 180, 200]  # Add outliers

df = pd.DataFrame({'value': data})

# Detection method 1: Z-score
z_scores = (df['value'] - df['value'].mean()) / df['value'].std()
outliers_zscore = np.abs(z_scores) > 3
print(f"Z-score method (|z| > 3): {outliers_zscore.sum()} outliers")

# Detection method 2: IQR
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = (df['value'] < lower_bound) | (df['value'] > upper_bound)
print(f"IQR method: {outliers_iqr.sum()} outliers")
print(f"Bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")

# Handling strategies
print("\nHandling strategies:")

# 1. Remove outliers
df_removed = df[~outliers_iqr]
print(f"1. Remove: {len(df_removed)} rows remain")

# 2. Cap/clip outliers (winsorization)
df_capped = df.copy()
df_capped['value'] = df['value'].clip(lower_bound, upper_bound)
print(f"2. Cap: max value now {df_capped['value'].max():.2f}")

# 3. Transform (log, sqrt)
df_log = df.copy()
df_log['value_log'] = np.log1p(df['value'])  # log(1+x) handles zeros
print(f"3. Log transform: reduces impact of large values")

# 4. Use robust methods (median, MAD, robust scalers)
print(f"4. Use robust statistics or RobustScaler")

Building Preprocessing Pipelines

Scikit-learn's Pipeline chains preprocessing steps, ensuring consistent application:

PYTHON
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

# Sample dataset
np.random.seed(42)
df = pd.DataFrame({
    'age': [25, np.nan, 35, 45, 55, 30, np.nan, 40],
    'income': [50000, 60000, np.nan, 80000, 90000, 55000, 70000, np.nan],
    'category': ['A', 'B', 'A', np.nan, 'B', 'A', 'B', 'A'],
    'target': [0, 1, 0, 1, 1, 0, 1, 0]
})

X = df.drop('target', axis=1)
y = df['target']

# Define transformations for different column types
numeric_features = ['age', 'income']
categorical_features = ['category']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False))
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Split and transform
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("Pipeline Result:")
print(f"Original columns: {X.columns.tolist()}")
print(f"Processed shape: {X_train_processed.shape}")
print(f"\nProcessed features (train):\n{X_train_processed}")

Full Pipeline with Model

The best practice is to include preprocessing in your model pipeline:

PYTHON
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np
import pandas as pd

# Create full pipeline including model
np.random.seed(42)

# Sample data
n = 200
df = pd.DataFrame({
    'num1': np.random.randn(n) * 10 + 50,
    'num2': np.random.randn(n) * 100 + 500,
    'cat1': np.random.choice(['A', 'B', 'C'], n),
    'target': np.random.binomial(1, 0.5, n)
})

# Add some missing values
df.loc[np.random.choice(n, 10), 'num1'] = np.nan
df.loc[np.random.choice(n, 10), 'cat1'] = np.nan

X = df.drop('target', axis=1)
y = df['target']

# Full pipeline
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('impute', SimpleImputer(strategy='mean')),
        ('scale', StandardScaler())
    ]), ['num1', 'num2']),
    ('cat', Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('encode', OneHotEncoder(drop='first', sparse_output=False))
    ]), ['cat1'])
])

full_pipeline = Pipeline([
    ('preprocess', preprocessor),
    ('model', LogisticRegression(random_state=42))
])

# Cross-validation (preprocessing applied correctly to each fold!)
scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='accuracy')
print("Full Pipeline with Cross-Validation")
print("-" * 50)
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
print("\nAdvantages of pipelines:")
print("• Preprocessing fits on training fold only")
print("• No data leakage between folds")
print("• Easy to serialize and deploy")
print("• Reproducible and maintainable")

Common Preprocessing Mistakes

Avoid these pitfalls:

PYTHON
print("Common Preprocessing Mistakes")
print("=" * 50)

print("""
1. DATA LEAKAGE
   Wrong: Scale entire dataset, then split
   Right: Split first, fit scaler on train only

2. INCONSISTENT ENCODING
   Wrong: Fit encoder separately on train and test
   Right: Fit on train, transform both

3. IGNORING TEST CATEGORIES
   Wrong: Assume test has same categories as train
   Right: Use handle_unknown='ignore' in OneHotEncoder

4. SCALING AFTER ENCODING
   Wrong: Scale one-hot encoded columns (already 0/1)
   Right: Scale only numerical columns

5. LEAKING TARGET INFO
   Wrong: Use target encoding with all data
   Right: Compute means on training data only

6. DROPPING TOO MUCH
   Wrong: Drop all rows with any missing value
   Right: Consider imputation, especially if many rows affected

7. FORGETTING TO PERSIST
   Wrong: Fit new preprocessors for inference
   Right: Save fitted preprocessors with the model
""")

Key Takeaways

  • Missing data: Choose strategy based on amount and mechanism (drop, impute, flag)
  • Categorical encoding: One-hot for nominal, ordinal for ordered, target for high cardinality
  • Scaling: StandardScaler for most cases, RobustScaler for outliers, MinMaxScaler for bounded algorithms
  • Outliers: Detect with IQR or z-score; handle by removing, capping, or transforming
  • Pipelines: Use sklearn Pipeline and ColumnTransformer for reproducible, leak-free preprocessing
  • Golden rule: Always fit preprocessors on training data only

4.5 Exploratory Data Analysis Beginner

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the process of understanding your data before building models. It reveals patterns, detects anomalies, tests assumptions, and guides feature engineering decisions. Rushing past EDA often leads to poor models and wasted time debugging issues that early exploration would have caught.

The EDA Mindset

EDA is detective work. You're asking questions:

  • What does each feature represent?
  • What are the distributions?
  • How do features relate to each other and to the target?
  • What's unusual, missing, or wrong?
  • What transformations might help?

Good EDA is iterative—answers lead to new questions.

Step 1: First Look at the Data

Always start with basic inspection:

PYTHON
import pandas as pd
import numpy as np

# Create sample dataset (simulating customer data)
np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10.5, 0.5, n),
    'tenure_months': np.random.randint(1, 120, n),
    'num_products': np.random.choice([1, 2, 3, 4], n, p=[0.5, 0.3, 0.15, 0.05]),
    'has_credit_card': np.random.choice([0, 1], n, p=[0.3, 0.7]),
    'is_active': np.random.choice([0, 1], n, p=[0.2, 0.8]),
    'country': np.random.choice(['USA', 'UK', 'Germany', 'France'], n, p=[0.4, 0.3, 0.2, 0.1]),
    'churned': np.random.choice([0, 1], n, p=[0.8, 0.2])
})

# Add some missing values
df.loc[np.random.choice(n, 50), 'income'] = np.nan
df.loc[np.random.choice(n, 30), 'tenure_months'] = np.nan

print("=" * 60)
print("STEP 1: FIRST LOOK")
print("=" * 60)

# Basic info
print("\n1.1 Shape and structure:")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

print("\n1.2 Data types:")
print(df.dtypes)

print("\n1.3 First few rows:")
print(df.head())

print("\n1.4 Basic statistics:")
print(df.describe())

print("\n1.5 Missing values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print(pd.DataFrame({'count': missing, 'percent': missing_pct})[missing > 0])

Step 2: Understanding Distributions

Examine each feature's distribution:

PYTHON
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Continue with our dataset
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10.5, 0.5, n),
    'tenure_months': np.random.randint(1, 120, n),
    'num_products': np.random.choice([1, 2, 3, 4], n, p=[0.5, 0.3, 0.15, 0.05]),
    'country': np.random.choice(['USA', 'UK', 'Germany', 'France'], n, p=[0.4, 0.3, 0.2, 0.1]),
    'churned': np.random.choice([0, 1], n, p=[0.8, 0.2])
})

print("=" * 60)
print("STEP 2: DISTRIBUTIONS")
print("=" * 60)

# Numerical features
print("\n2.1 Numerical feature statistics:")
numerical_cols = df.select_dtypes(include=[np.number]).columns
for col in ['age', 'income', 'tenure_months']:
    series = df[col]
    print(f"\n{col}:")
    print(f"  Range: [{series.min():.2f}, {series.max():.2f}]")
    print(f"  Mean: {series.mean():.2f}, Median: {series.median():.2f}")
    print(f"  Std: {series.std():.2f}")
    print(f"  Skewness: {series.skew():.2f}")

# Check for skewness
print("\n2.2 Skewness interpretation:")
print("  |skew| < 0.5: approximately symmetric")
print("  0.5 < |skew| < 1: moderately skewed")
print("  |skew| > 1: highly skewed (consider transformation)")

# Categorical features
print("\n2.3 Categorical value counts:")
for col in ['num_products', 'country']:
    print(f"\n{col}:")
    print(df[col].value_counts())

Visualize distributions:

PYTHON
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# Sample data
np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10.5, 0.5, n),
    'tenure_months': np.random.randint(1, 120, n),
    'churned': np.random.choice([0, 1], n, p=[0.8, 0.2])
})

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Histograms with KDE
for i, col in enumerate(['age', 'income', 'tenure_months']):
    sns.histplot(df[col], kde=True, ax=axes[0, i])
    axes[0, i].set_title(f'Distribution of {col}')
    axes[0, i].axvline(df[col].mean(), color='r', linestyle='--', label='Mean')
    axes[0, i].axvline(df[col].median(), color='g', linestyle='--', label='Median')
    axes[0, i].legend()

# Box plots
for i, col in enumerate(['age', 'income', 'tenure_months']):
    sns.boxplot(y=df[col], ax=axes[1, i])
    axes[1, i].set_title(f'Box plot of {col}')

plt.tight_layout()
plt.show()

print("What to look for:")
print("• Shape: normal, skewed, bimodal?")
print("• Outliers: points beyond whiskers in box plots")
print("• Mean vs Median: large gap indicates skewness")

Step 3: Target Variable Analysis

Understanding the target is crucial for model selection:

PYTHON
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
n = 1000
df = pd.DataFrame({
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10.5, 0.5, n),
    'churned': np.random.choice([0, 1], n, p=[0.8, 0.2])
})

print("=" * 60)
print("STEP 3: TARGET ANALYSIS")
print("=" * 60)

# For classification: class distribution
print("\n3.1 Class distribution (classification):")
target_counts = df['churned'].value_counts()
target_pcts = df['churned'].value_counts(normalize=True) * 100
print(pd.DataFrame({'count': target_counts, 'percent': target_pcts.round(2)}))

# Check for imbalance
imbalance_ratio = target_counts.max() / target_counts.min()
print(f"\nImbalance ratio: {imbalance_ratio:.2f}")
if imbalance_ratio > 3:
    print("⚠ Warning: Class imbalance detected!")
    print("  Consider: SMOTE, class weights, or different metrics")
else:
    print("✓ Classes are reasonably balanced")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart
df['churned'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title('Target Class Distribution')
axes[0].set_xlabel('Churned')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No (0)', 'Yes (1)'], rotation=0)

# For regression targets: distribution
continuous_target = df['income']  # Using income as example
sns.histplot(continuous_target, kde=True, ax=axes[1])
axes[1].set_title('Continuous Target Distribution')
axes[1].axvline(continuous_target.mean(), color='r', linestyle='--', label='Mean')

plt.tight_layout()
plt.show()

Step 4: Feature-Target Relationships

How do features relate to the target?

PYTHON
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
n = 1000

# Create data with actual relationships
age = np.random.randint(18, 70, n)
income = np.random.lognormal(10.5, 0.5, n)
tenure = np.random.randint(1, 120, n)

# Churn probability depends on features
churn_prob = 0.1 + 0.3 * (age < 30) + 0.2 * (tenure < 24) - 0.1 * (income > 50000)
churn_prob = np.clip(churn_prob, 0.05, 0.95)
churned = np.random.binomial(1, churn_prob)

df = pd.DataFrame({
    'age': age,
    'income': income,
    'tenure_months': tenure,
    'churned': churned
})

print("=" * 60)
print("STEP 4: FEATURE-TARGET RELATIONSHIPS")
print("=" * 60)

# Numerical features vs binary target
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, col in enumerate(['age', 'income', 'tenure_months']):
    sns.boxplot(x='churned', y=col, data=df, ax=axes[i])
    axes[i].set_title(f'{col} by Churn Status')
    axes[i].set_xticklabels(['Not Churned', 'Churned'])

plt.tight_layout()
plt.show()

# Statistical comparison
print("\n4.1 Feature means by target class:")
print(df.groupby('churned')[['age', 'income', 'tenure_months']].mean().round(2))

# Calculate effect sizes
print("\n4.2 Effect sizes (difference in means / pooled std):")
for col in ['age', 'income', 'tenure_months']:
    group0 = df[df['churned'] == 0][col]
    group1 = df[df['churned'] == 1][col]
    pooled_std = np.sqrt((group0.std()**2 + group1.std()**2) / 2)
    effect_size = (group1.mean() - group0.mean()) / pooled_std
    print(f"  {col}: {effect_size:.3f}")

print("\n  Interpretation: |d| < 0.2 small, 0.2-0.8 medium, > 0.8 large")

Step 5: Feature Correlations

Identify redundant features and multicollinearity:

PYTHON
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
n = 1000

# Create correlated features
x1 = np.random.randn(n)
x2 = x1 * 0.8 + np.random.randn(n) * 0.4  # Highly correlated with x1
x3 = np.random.randn(n)
x4 = x3 * -0.6 + np.random.randn(n) * 0.6  # Negatively correlated with x3
x5 = np.random.randn(n)  # Independent

df = pd.DataFrame({
    'feature_1': x1,
    'feature_2': x2,
    'feature_3': x3,
    'feature_4': x4,
    'feature_5': x5,
    'target': (x1 + x3 + np.random.randn(n) * 0.5 > 0).astype(int)
})

print("=" * 60)
print("STEP 5: CORRELATION ANALYSIS")
print("=" * 60)

# Correlation matrix
corr_matrix = df.corr()
print("\n5.1 Correlation matrix:")
print(corr_matrix.round(3))

# Visualize
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix), k=1)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            mask=mask, square=True, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

# Identify highly correlated pairs
print("\n5.2 Highly correlated feature pairs (|r| > 0.7):")
high_corr = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.7:
            high_corr.append((corr_matrix.columns[i], corr_matrix.columns[j],
                            corr_matrix.iloc[i, j]))

if high_corr:
    for f1, f2, r in high_corr:
        print(f"  {f1} vs {f2}: r = {r:.3f}")
    print("\n⚠ Consider removing one feature from highly correlated pairs")
else:
    print("  No highly correlated pairs found")

# Correlation with target
print("\n5.3 Correlation with target:")
target_corr = corr_matrix['target'].drop('target').sort_values(key=abs, ascending=False)
print(target_corr.round(3))

Step 6: Identifying Data Quality Issues

Systematic checks for problems:

PYTHON
import pandas as pd
import numpy as np

np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'id': range(n),
    'age': np.random.randint(18, 70, n),
    'income': np.random.lognormal(10.5, 0.5, n),
    'score': np.random.randint(0, 100, n),
    'category': np.random.choice(['A', 'B', 'C', 'Other', 'other', ''], n)
})

# Add some issues
df.loc[50, 'age'] = 150  # Impossible value
df.loc[51, 'age'] = -5   # Negative age
df.loc[100:105, 'id'] = df.loc[100:105, 'id'].values  # Duplicates
df.loc[200:210, 'income'] = np.nan  # Missing values

print("=" * 60)
print("STEP 6: DATA QUALITY CHECKS")
print("=" * 60)

# Check 1: Duplicates
n_duplicates = df.duplicated().sum()
print(f"\n6.1 Duplicate rows: {n_duplicates}")
if n_duplicates > 0:
    print("  ⚠ Duplicates found - investigate and deduplicate")

# Check 2: Missing values
print("\n6.2 Missing values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'count': missing, 'percent': missing_pct})
print(missing_df[missing > 0])

# Check 3: Impossible values
print("\n6.3 Value range checks:")
print(f"  Age range: [{df['age'].min()}, {df['age'].max()}]")
if df['age'].min() < 0 or df['age'].max() > 120:
    print("  ⚠ Age values outside reasonable range!")

# Check 4: Inconsistent categories
print("\n6.4 Category values:")
print(f"  Unique categories: {df['category'].unique()}")
print("  ⚠ Check for: typos, case inconsistency, empty strings")

# Check 5: Cardinality
print("\n6.5 Cardinality check:")
for col in df.select_dtypes(include=['object']).columns:
    n_unique = df[col].nunique()
    print(f"  {col}: {n_unique} unique values")
    if n_unique > 50:
        print(f"  ⚠ High cardinality - consider grouping or encoding")

# Summary
print("\n" + "=" * 60)
print("DATA QUALITY SUMMARY")
print("=" * 60)
issues = []
if n_duplicates > 0:
    issues.append(f"• {n_duplicates} duplicate rows")
if missing.sum() > 0:
    issues.append(f"• {missing.sum()} total missing values")
if df['age'].min() < 0:
    issues.append("• Negative age values")
if df['age'].max() > 120:
    issues.append("• Unrealistic age values")

if issues:
    print("\nIssues to address:")
    for issue in issues:
        print(issue)
else:
    print("\n✓ No major quality issues detected")

EDA Checklist

A systematic approach to ensure thorough exploration:

PYTHON
print("""
EDA CHECKLIST
=============

□ DATA OVERVIEW
  □ Shape (rows, columns)
  □ Data types
  □ Memory usage
  □ Sample rows (head, tail, random)

□ MISSING DATA
  □ Count per column
  □ Percentage per column
  □ Patterns (MCAR, MAR, MNAR?)

□ NUMERICAL FEATURES
  □ Summary statistics (mean, median, std, min, max)
  □ Distributions (histograms, KDE)
  □ Skewness and kurtosis
  □ Outliers (box plots, IQR, z-scores)

□ CATEGORICAL FEATURES
  □ Unique values
  □ Value counts
  □ Cardinality
  □ Inconsistencies (typos, case)

□ TARGET VARIABLE
  □ Distribution
  □ Class balance (classification)
  □ Range and spread (regression)

□ RELATIONSHIPS
  □ Numerical correlations
  □ Feature vs target
  □ Pairwise scatter plots
  □ Grouped statistics

□ DATA QUALITY
  □ Duplicates
  □ Impossible values
  □ Inconsistent formats
  □ Data type mismatches

□ FEATURE ENGINEERING IDEAS
  □ Transformations needed?
  □ Interactions worth exploring?
  □ Binning opportunities?
  □ Features to drop?
""")

Automated EDA Tools

Several libraries automate EDA:

PYTHON
print("Automated EDA Libraries")
print("=" * 50)

print("""
1. pandas-profiling (ydata-profiling)
   from ydata_profiling import ProfileReport
   profile = ProfileReport(df)
   profile.to_file("report.html")

   Generates: comprehensive HTML report with all statistics

2. sweetviz
   import sweetviz as sv
   report = sv.analyze(df)
   report.show_html("report.html")

   Generates: comparative analysis, target-focused

3. dtale
   import dtale
   dtale.show(df)

   Provides: interactive web-based exploration

4. autoviz
   from autoviz.AutoViz_Class import AutoViz_Class
   AV = AutoViz_Class()
   AV.AutoViz("data.csv")

   Generates: automatic visualization selection

Note: These are great for quick overviews but don't replace
thoughtful manual exploration for important projects.
""")

Key Takeaways

  • Start simple: Shape, types, head/tail, describe
  • Check distributions: Histograms, box plots, skewness
  • Examine the target: Balance (classification), spread (regression)
  • Find relationships: Correlation matrix, feature-target plots
  • Assess quality: Missing values, duplicates, impossible values
  • Document findings: Keep notes on issues and insights
  • Iterate: Each answer leads to new questions
  • EDA is not optional—it prevents costly mistakes downstream