Cross-Validation in Machine Learning using Python
Introduction
Cross-validation is a technique for evaluating the performance of a machine-learning model on unseen data. It ensures that the model generalizes well by testing it on different subsets of data.
Why Use Cross-Validation?
Types of Cross-Validation
Steps for Cross-Validation in Python
Python Implementation
1. Dataset Preparation
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load the Iris dataset
data = load_iris()
X = datafeatures # Features
y = data.target # Target
# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
2. Hold-Out Validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Train a Random Forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
# Test the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Hold-Out Validation Accuracy: {accuracy:.2f}")
3. K-Fold Cross-Validation
from sklearn.model_selection import KFold, cross_val_score
# K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
Recommended by LinkedIn
scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(f"K-Fold Cross-Validation Scores: {scores}")
print(f"Mean Accuracy: {scores.mean():.2f}")
4. Stratified K-Fold Cross-Validation
from sklearn.model_selection import StratifiedKFold
# Stratified K-Fold Cross-Validation
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=stratified_kfold, scoring='accuracy')
print(f"Stratified K-Fold Scores: {scores}")
print(f"Mean Accuracy: {scores.mean():.2f}")
5. Leave-One-Out Cross-Validation (LOOCV)
from sklearn.model_selection import LeaveOneOut
# Leave-One-Out Cross-Validation
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')
print(f"LOOCV Accuracy: {scores.mean():.2f}")
6. Repeated K-Fold Cross-Validation
from sklearn.model_selection import RepeatedKFold
# Repeated K-Fold Cross-Validation
repeated_kfold = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=repeated_kfold, scoring='accuracy')
print(f"Repeated K-Fold Mean Accuracy: {scores.mean():.2f}")
Tips for Cross-Validation
Advantages of Cross-Validation
Disadvantages of Cross-Validation
Conclusion
Cross-validation is a vital technique to evaluate machine learning models effectively. Using methods like K-Fold or Stratified K-Fold ensures your model generalizes well and avoids overfitting. Try it on other real datasets. Follow me for more insights.
O. Olawale Awe, PhD, MBA.
African History with special focus on migration dynamics, culture, diaspora studies and Digital history
1wProf we need a collaboration on this: Humanities and data science