Classification Metrics and Evaluation
Evaluating classification models requires more than just counting correct predictions. Different applications have different costs for different types of errors. A spam filter that blocks important emails is worse than one that lets some spam through. A medical test that misses cancer is worse than one that triggers unnecessary follow-up tests.
Understanding classification metrics helps you choose the right metric for your problem and interpret model performance correctly.
The Confusion Matrix
The confusion matrix is the foundation of classification evaluation. For binary classification, it's a 2×2 table showing how predictions compare to actual labels:
| | Predicted Positive | Predicted Negative | |--------------------|--------------------|--------------------| | Actual Positive | True Positive (TP) | False Negative (FN) | | Actual Negative | False Positive (FP) | True Negative (TN) |
- True Positive (TP): Correctly predicted positive
- True Negative (TN): Correctly predicted negative
- False Positive (FP): Incorrectly predicted positive (Type I error)
- False Negative (FN): Incorrectly predicted negative (Type II error)
All classification metrics derive from these four numbers.
import numpy as np
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
# Generate and split data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_classes=2, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"\nTN={cm[0,0]}, FP={cm[0,1]}")
print(f"FN={cm[1,0]}, TP={cm[1,1]}")
# Visualize
ConfusionMatrixDisplay(cm, display_labels=['Negative', 'Positive']).plot()
plt.title('Confusion Matrix')
plt.show()Accuracy: The Intuitive but Flawed Metric
Accuracy is the proportion of correct predictions:
Accuracy is intuitive but can be misleading with imbalanced classes. If 99% of emails are not spam, a model that predicts "not spam" for everything achieves 99% accuracy while being completely useless.
import numpy as np
from sklearn.metrics import accuracy_score
# Imbalanced scenario: 95% negative, 5% positive
y_true = np.array([0]*950 + [1]*50)
# Naive model: always predict negative
y_pred_naive = np.zeros(1000)
# Slightly better model
y_pred_better = np.array([0]*940 + [1]*60)
# Let's say it gets 40 of the 50 positives right, but has 20 false positives
print("The Accuracy Paradox:")
print(f"Always predict negative: {accuracy_score(y_true, y_pred_naive):.1%} accuracy")
print("But this model is useless for detecting positives!")
print("\nAccuracy is misleading with imbalanced classes.")Precision and Recall
Precision answers: "Of all positive predictions, how many were actually positive?"
High precision means few false positives. Important when false positives are costly (spam filter blocking legitimate email).
Recall (also called sensitivity or true positive rate) answers: "Of all actual positives, how many did we correctly identify?"
High recall means few false negatives. Important when false negatives are costly (cancer screening, fraud detection).
from sklearn.metrics import precision_score, recall_score
import numpy as np
# Simulated predictions
y_true = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
y_pred = np.array([1, 1, 0, 0, 0, 0, 0, 1, 0, 0])
# TP=2, FP=1, FN=3, TN=4
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
print("Precision vs Recall:")
print(f"True labels: {y_true}")
print(f"Predictions: {y_pred}")
print(f"\nPrecision: {precision:.2f} (2 of 3 positive predictions were correct)")
print(f"Recall: {recall:.2f} (2 of 5 actual positives were found)")The Precision-Recall Tradeoff
There's an inherent tradeoff between precision and recall. Lowering the classification threshold catches more positives (higher recall) but also increases false positives (lower precision). Raising the threshold increases precision but decreases recall.
The right balance depends on the application. Cancer screening should favor recall (don't miss cancer). Spam filtering might favor precision (don't block important emails).
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score
# Generate data
X, y = make_classification(n_samples=1000, weights=[0.7, 0.3], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model and get probabilities
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# Vary threshold
thresholds = np.arange(0.1, 0.9, 0.05)
precisions, recalls = [], []
for t in thresholds:
y_pred = (y_proba >= t).astype(int)
precisions.append(precision_score(y_test, y_pred, zero_division=0))
recalls.append(recall_score(y_test, y_pred))
plt.figure(figsize=(8, 5))
plt.plot(thresholds, precisions, 'b-', label='Precision')
plt.plot(thresholds, recalls, 'r-', label='Recall')
plt.xlabel('Classification Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print("As threshold increases:")
print(" - Precision tends to increase (fewer false positives)")
print(" - Recall tends to decrease (more false negatives)")F1 Score: Balancing Precision and Recall
The F1 score is the harmonic mean of precision and recall:
The harmonic mean penalizes extreme values—both precision and recall must be high for a good F1 score. F1 ranges from 0 to 1, with 1 being perfect.
More generally, the Fβ score allows weighting:
- β = 1: Equal weight (standard F1)
- β = 2: Recall is twice as important (F2)
- β = 0.5: Precision is twice as important (F0.5)
from sklearn.metrics import f1_score, fbeta_score, precision_score, recall_score
import numpy as np
y_true = np.array([1]*100 + [0]*900) # 10% positive
y_pred = np.array([1]*80 + [0]*20 + [1]*50 + [0]*850) # Some errors
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
f2 = fbeta_score(y_true, y_pred, beta=2)
f05 = fbeta_score(y_true, y_pred, beta=0.5)
print("F-scores:")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1 (balanced): {f1:.3f}")
print(f"F2 (recall-weighted): {f2:.3f}")
print(f"F0.5 (precision-weighted): {f05:.3f}")ROC Curve and AUC
The ROC curve (Receiver Operating Characteristic) plots True Positive Rate vs False Positive Rate at various thresholds:
- True Positive Rate (TPR) = Recall = TP/(TP+FN)
- False Positive Rate (FPR) = FP/(FP+TN)
A random classifier produces a diagonal line. Better classifiers curve toward the upper-left corner.
AUC (Area Under the ROC Curve) summarizes the curve in a single number:
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random classifier
- AUC < 0.5: Worse than random (invert predictions!)
AUC measures the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate data
X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"AUC: {auc:.3f}")
print("AUC represents the probability of ranking a random positive above a random negative")Precision-Recall Curve
For imbalanced datasets, the Precision-Recall curve is often more informative than ROC. It plots precision vs recall at various thresholds.
The area under the PR curve (PR-AUC or Average Precision) summarizes performance. Unlike ROC-AUC, PR-AUC is sensitive to class imbalance—a random classifier achieves PR-AUC equal to the positive class proportion, not 0.5.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Imbalanced data
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'PR Curve (AP = {ap:.3f})')
plt.axhline(y=y_test.mean(), color='r', linestyle='--', label=f'Random (baseline = {y_test.mean():.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve (Imbalanced Data)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"Average Precision: {ap:.3f}")
print(f"Positive class proportion: {y_test.mean():.2f}")Multiclass Metrics
For multiclass classification, metrics can be computed in several ways:
Macro averaging: Compute metric for each class, then average. Treats all classes equally.
Micro averaging: Aggregate TP, FP, FN across classes, then compute metric. Weighted by class frequency.
Weighted averaging: Average metrics weighted by class support (number of samples per class).
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Multiclass problem
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.3, random_state=42
)
model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Multiclass Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print("\nAveraging Methods:")
print(f"Macro F1: {f1_score(y_test, y_pred, average='macro'):.3f}")
print(f"Micro F1: {f1_score(y_test, y_pred, average='micro'):.3f}")
print(f"Weighted F1: {f1_score(y_test, y_pred, average='weighted'):.3f}")Log Loss (Cross-Entropy Loss)
Log loss evaluates predicted probabilities, not just class predictions. It penalizes confident wrong predictions heavily.
Lower log loss is better. Log loss captures calibration—whether predicted probabilities match actual frequencies.
from sklearn.metrics import log_loss
import numpy as np
y_true = np.array([1, 1, 1, 0, 0, 0])
# Well-calibrated predictions
y_proba_good = np.array([0.9, 0.8, 0.7, 0.2, 0.1, 0.3])
# Overconfident wrong predictions
y_proba_bad = np.array([0.99, 0.99, 0.01, 0.01, 0.01, 0.99])
print("Log Loss Comparison:")
print(f"Well-calibrated: {log_loss(y_true, y_proba_good):.4f}")
print(f"Overconfident wrong: {log_loss(y_true, y_proba_bad):.4f}")
print("\nOverconfident wrong predictions are heavily penalized!")Choosing the Right Metric
The right metric depends on your problem:
| Scenario | Recommended Metric | |----------|-------------------| | Balanced classes, care about overall accuracy | Accuracy | | Imbalanced classes | F1, PR-AUC | | Cost of FP >> Cost of FN | Precision, or F0.5 | | Cost of FN >> Cost of FP | Recall, or F2 | | Need to rank predictions | ROC-AUC | | Probability calibration matters | Log Loss | | Compare models across thresholds | AUC (ROC or PR) |
import numpy as np
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score)
def evaluate_model(y_true, y_pred, y_proba):
"""Compute multiple metrics."""
metrics = {
'Accuracy': accuracy_score(y_true, y_pred),
'Precision': precision_score(y_true, y_pred),
'Recall': recall_score(y_true, y_pred),
'F1': f1_score(y_true, y_pred),
'ROC-AUC': roc_auc_score(y_true, y_proba),
'PR-AUC': average_precision_score(y_true, y_proba)
}
return metrics
# Simulated predictions
np.random.seed(42)
y_true = np.array([1]*100 + [0]*900) # 10% positive
y_proba = np.random.beta(2, 5, 1000) # Random probabilities
y_proba[:100] += 0.3 # Boost positives slightly
y_proba = np.clip(y_proba, 0, 1)
y_pred = (y_proba >= 0.5).astype(int)
metrics = evaluate_model(y_true, y_pred, y_proba)
print("Comprehensive Evaluation:")
for name, value in metrics.items():
print(f" {name:<12}: {value:.3f}")
print("\nChoose metrics based on your problem's costs and constraints!")Practical Guidelines
Always examine the confusion matrix first: It shows exactly where your model makes errors.
Don't rely on accuracy alone: Especially with imbalanced classes.
Consider business costs: Weight metrics by the actual cost of each error type.
Use multiple metrics: Different metrics reveal different aspects of performance.
Compare to baselines: How does your model compare to always predicting the majority class?
Look at calibration: If you need probabilities, verify they're well-calibrated.
Key Takeaways
- The confusion matrix shows TP, TN, FP, FN—the foundation of all metrics
- Accuracy is misleading with imbalanced classes
- Precision measures false positive rate; recall measures false negative rate
- There's a tradeoff between precision and recall controlled by the classification threshold
- F1 score balances precision and recall; Fβ allows weighting
- ROC-AUC measures ranking ability across all thresholds
- PR-AUC is more informative for imbalanced data
- Log loss evaluates probability calibration
- Choose metrics based on business costs and class balance
- Always examine the confusion matrix and use multiple metrics