Train/Validation/Test Split
How do you know if your machine learning model will work on new data it has never seen? You can't wait until deployment to find out—by then, poor predictions could have real consequences. The solution is to simulate the future by holding out part of your data during development.
Proper data splitting is fundamental to building reliable models. It's also one of the most common sources of mistakes, leading to overly optimistic performance estimates that collapse when models meet reality.
Why Split Data?
Machine learning models are excellent at finding patterns—sometimes too excellent. Given enough capacity, a model can memorize the training data perfectly, achieving zero training error while learning nothing generalizable. This is overfitting: the model captures noise and idiosyncrasies of the training set rather than the underlying signal.
To detect overfitting and estimate real-world performance, we need data the model has never seen during training. This is the purpose of data splitting: reserve some data exclusively for evaluation.
Consider this analogy: if a student sees the exact exam questions while studying, their test score won't reflect their true understanding. We need unseen questions to assess genuine learning.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# Generate simple data
np.random.seed(42)
X = np.linspace(0, 1, 30).reshape(-1, 1)
y = np.sin(2 * np.pi * X).ravel() + np.random.randn(30) * 0.2
# Fit a high-degree polynomial (will overfit)
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
# Training error looks great!
train_pred = model.predict(X_poly)
train_mse = mean_squared_error(y, train_pred)
print(f"Training MSE: {train_mse:.4f}")
print("Looks great! But is this model actually good?")
print("We can't know without testing on unseen data.")The Three-Way Split
The standard approach divides data into three sets:
Training set (60-80%): Used to fit model parameters. The model sees these examples during learning. All gradient computations, weight updates, and pattern extraction happen here.
Validation set (10-20%): Used for model selection and hyperparameter tuning. We evaluate candidate models on validation data to choose the best one. The model never trains on this data, but we use it to make decisions.
Test set (10-20%): Held out completely until final evaluation. Touched only once to get an unbiased estimate of real-world performance. Never used for any decisions during development.
The key insight is that any data used to make decisions becomes "seen" in some sense. If you tune hyperparameters based on test performance, you're implicitly fitting to the test set, making your final estimate optimistic.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
random_state=42)
# First split: separate test set (final evaluation only)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Second split: separate validation from training
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42 # 0.25 * 0.8 = 0.2
)
print("Data Split:")
print(f" Training: {len(X_train):4d} samples ({len(X_train)/len(X):.0%})")
print(f" Validation: {len(X_val):4d} samples ({len(X_val)/len(X):.0%})")
print(f" Test: {len(X_test):4d} samples ({len(X_test)/len(X):.0%})")
print(f" Total: {len(X):4d} samples")The Workflow
A proper machine learning workflow respects the data splits:
- Split data into train/validation/test before any analysis
- Explore and preprocess using only training data statistics
- Train models on training data
- Evaluate and tune using validation data
- Select final model based on validation performance
- Report final performance on test data (once!)
The test set is like a sealed envelope. Opening it should happen only at the very end, and only once.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import numpy as np
# Setup: split data
np.random.seed(42)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
# Step 2: Preprocess (fit on training only!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
# Steps 3-4: Train and evaluate multiple models on validation
models = {
'Logistic Regression': LogisticRegression(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}
print("Model Selection (using validation set):")
best_model = None
best_val_acc = 0
for name, model in models.items():
model.fit(X_train_scaled, y_train)
val_acc = accuracy_score(y_val, model.predict(X_val_scaled))
print(f" {name}: {val_acc:.4f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
best_model = model
best_name = name
# Step 5-6: Final evaluation on test set (only once!)
test_acc = accuracy_score(y_test, best_model.predict(X_test_scaled))
print(f"\nSelected model: {best_name}")
print(f"Final test accuracy: {test_acc:.4f}")Stratified Splitting
For classification with imbalanced classes, random splitting might put all rare class examples in one set by chance. Stratified splitting ensures each split has the same class proportions as the original data.
from sklearn.model_selection import train_test_split
import numpy as np
# Imbalanced dataset: 90% class 0, 10% class 1
np.random.seed(42)
X = np.random.randn(1000, 5)
y = np.array([0]*900 + [1]*100)
np.random.shuffle(y)
# Regular split (might have imbalanced splits)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Stratified split (preserves class proportions)
X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print("Class Distribution Comparison:")
print(f"Original: {np.mean(y):.2%} class 1")
print(f"Regular test set: {np.mean(y_test_reg):.2%} class 1")
print(f"Stratified test set: {np.mean(y_test_strat):.2%} class 1")
print("\nStratified splitting preserves class balance!")Time Series Splitting
For time series data, random splitting violates temporal ordering—you'd be using future data to predict the past. Instead, use temporal splitting: train on earlier data, test on later data.
import numpy as np
import pandas as pd
# Simulated time series
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=365, freq='D')
values = np.cumsum(np.random.randn(365)) + np.arange(365) * 0.1
df = pd.DataFrame({'date': dates, 'value': values})
# Wrong: random split (data leakage!)
# train, test = train_test_split(df, test_size=0.2)
# Correct: temporal split
train_size = int(len(df) * 0.8)
train = df.iloc[:train_size]
test = df.iloc[train_size:]
print("Time Series Split:")
print(f"Training: {train['date'].min()} to {train['date'].max()} ({len(train)} days)")
print(f"Test: {test['date'].min()} to {test['date'].max()} ({len(test)} days)")
print("\nTraining data is strictly before test data.")Data Leakage
Data leakage occurs when information from outside the training set improperly influences model training. This leads to overly optimistic performance estimates that don't reflect real-world performance.
Common sources of leakage:
Preprocessing on full data: Fitting scalers, encoders, or imputers on all data before splitting. The training set then contains information about the test set.
Feature engineering using future information: For time series, using features that wouldn't be available at prediction time.
Target leakage: Features that encode the target variable or are caused by it rather than predicting it.
Duplicate or near-duplicate data: Same examples appearing in both train and test sets.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
np.random.seed(42)
X = np.random.randn(1000, 10)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
# WRONG: Scale before splitting (leakage!)
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X) # Sees all data!
X_train_wrong, X_test_wrong, y_train, y_test = train_test_split(
X_scaled_wrong, y, test_size=0.2, random_state=42
)
# CORRECT: Split first, then scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler_correct = StandardScaler()
X_train_correct = scaler_correct.fit_transform(X_train) # Fit only on train
X_test_correct = scaler_correct.transform(X_test) # Transform test
print("Data Leakage Example:")
print("In this simple case, results are similar, but with real data,")
print("leakage can give overly optimistic estimates that fail in production.")
print("\nAlways: Split first, then preprocess!")Using Pipelines to Prevent Leakage
Scikit-learn Pipelines help prevent leakage by bundling preprocessing with modeling. The pipeline ensures preprocessing is fit only on training data.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=50, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Pipeline ensures proper ordering
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10)),
('classifier', LogisticRegression(random_state=42))
])
# During cross-validation, preprocessing is fit fresh on each fold's training data
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
# Final fit and evaluation
pipeline.fit(X_train, y_train)
test_acc = pipeline.score(X_test, y_test)
print("Pipeline ensures no data leakage:")
print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Test Accuracy: {test_acc:.4f}")When Data is Limited
With small datasets, holding out 20% for validation and 20% for testing leaves little for training. Alternatives:
Cross-validation for validation: Use k-fold cross-validation instead of a fixed validation set. More robust estimates with all data used for training.
Nested cross-validation: Outer loop for test evaluation, inner loop for hyperparameter tuning. Most rigorous but computationally expensive.
Bootstrap estimation: Sample with replacement to create training sets; use out-of-bag samples for validation.
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np
# Small dataset
X, y = make_classification(n_samples=200, n_features=20, random_state=42)
# With limited data, use cross-validation instead of fixed splits
model = LogisticRegression(random_state=42)
cv_scores = cross_val_score(model, X, y, cv=10) # 10-fold CV
print("Cross-Validation with Limited Data:")
print(f"10-Fold CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Each fold uses {len(X)*0.9:.0f} samples for training")
print("\nMore reliable than a single 80/20 split with small data.")Common Mistakes
Using test data for decisions: If you check test performance and then change your model, the test set is no longer unbiased.
Forgetting to stratify: With imbalanced classes, splits might not represent the true distribution.
Ignoring temporal structure: Random splitting time series creates unrealistic train/test scenarios.
Data leakage in preprocessing: Fitting transformers on full data before splitting.
Multiple test evaluations: Each time you evaluate on the test set and adjust, you're implicitly fitting to it.
# Anti-pattern: "Peeking" at test data
# DON'T DO THIS:
# model_v1.fit(X_train, y_train)
# print(f"Test acc: {model_v1.score(X_test, y_test)}") # Peek at test
#
# # "Hmm, not good enough, let me try something else"
# model_v2.fit(X_train, y_train) # Adjusted based on test result
# print(f"Test acc: {model_v2.score(X_test, y_test)}") # Peek again
#
# # This process makes test accuracy optimistic!
print("Don't peek at test data multiple times!")
print("Use validation set for model development.")
print("Test set is for final, one-time evaluation only.")Practical Guidelines
Decide splits before looking at data: Avoid the temptation to adjust splits based on results.
Use stratification for classification: Especially with imbalanced classes.
Respect temporal order: For time series, always test on future data.
Use pipelines: They help prevent preprocessing leakage.
Document your splits: Record random seeds and methodology for reproducibility.
When in doubt, use cross-validation: More robust than single splits, especially with limited data.
Key Takeaways
- Training data fits the model; validation data selects hyperparameters; test data estimates final performance
- The test set should be touched only once for final evaluation
- Stratified splitting preserves class proportions in each split
- Time series requires temporal splitting—train on past, test on future
- Data leakage occurs when test information influences training; use pipelines to prevent it
- With limited data, cross-validation is more reliable than fixed splits
- Common mistakes: peeking at test data, forgetting stratification, preprocessing before splitting
- Proper splitting is essential for realistic performance estimates