Cross-Validation in Machine Learning using Python

O. Olawale AWE, Ph.D., MBA

Data Scientist| Statistician|Professor|Consultant| Administrator| Mentor|Editor|Author

Published Dec 6, 2024

Introduction

Cross-validation is a technique for evaluating the performance of a machine-learning model on unseen data. It ensures that the model generalizes well by testing it on different subsets of data.

Why Use Cross-Validation?

Avoid Overfitting: Ensures the model is not tailored to the training data alone.
Reliable Evaluation: Provides a robust estimate of the model's performance.
Effective Data Utilization: Uses all data for training and testing.

Types of Cross-Validation

Hold-Out Validation:
K-Fold Cross-Validation:
Stratified K-Fold:
Leave-One-Out Cross-Validation (LOOCV):
Repeated K-Fold:

Steps for Cross-Validation in Python

Import necessary libraries.
Prepare your dataset.
Choose the cross-validation technique.
Evaluate the model using the chosen technique.

Python Implementation

1. Dataset Preparation

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

# Load the Iris dataset

data = load_iris()

X = datafeatures # Features

y = data.target # Target

# Split into training and testing datasets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

2. Hold-Out Validation

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

# Train a Random Forest model

model = RandomForestClassifier(random_state=42)

model.fit(X_train, y_train)

# Test the model

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print(f"Hold-Out Validation Accuracy: {accuracy:.2f}")

3. K-Fold Cross-Validation

from sklearn.model_selection import KFold, cross_val_score

# K-Fold Cross-Validation

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

Recommended by LinkedIn

Understanding Bayesian with Examples In Python

Rany ElHousieny, PhDᴬᴮᴰ 1 year ago

Code Snippets for Statistical Tests in Python

Gustavo R Santos 1 month ago

SIMPLE LINEAR REGRESSION IN PYTHON :

Santhosh Kumar 2 years ago

scores = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')

print(f"K-Fold Cross-Validation Scores: {scores}")

print(f"Mean Accuracy: {scores.mean():.2f}")

4. Stratified K-Fold Cross-Validation

from sklearn.model_selection import StratifiedKFold

# Stratified K-Fold Cross-Validation

stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=stratified_kfold, scoring='accuracy')

print(f"Stratified K-Fold Scores: {scores}")

print(f"Mean Accuracy: {scores.mean():.2f}")

5. Leave-One-Out Cross-Validation (LOOCV)

from sklearn.model_selection import LeaveOneOut

# Leave-One-Out Cross-Validation

loo = LeaveOneOut()

scores = cross_val_score(model, X, y, cv=loo, scoring='accuracy')

print(f"LOOCV Accuracy: {scores.mean():.2f}")

6. Repeated K-Fold Cross-Validation

from sklearn.model_selection import RepeatedKFold

# Repeated K-Fold Cross-Validation

repeated_kfold = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

scores = cross_val_score(model, X, y, cv=repeated_kfold, scoring='accuracy')

print(f"Repeated K-Fold Mean Accuracy: {scores.mean():.2f}")

Tips for Cross-Validation

Use Stratified K-Fold for imbalanced datasets to ensure balanced class representation.
For small datasets, LOOCV provides a thorough evaluation but can be computationally expensive.
Repeated K-Fold adds randomness to K-Fold for better performance estimation.

Advantages of Cross-Validation

Reduces bias in model evaluation.
Ensures the model performs well across different data subsets.
Helps in model selection and hyperparameter tuning.

Disadvantages of Cross-Validation

Computationally expensive, especially for large datasets.
LOOCV can be prone to high variance with small datasets.

Conclusion

Cross-validation is a vital technique to evaluate machine learning models effectively. Using methods like K-Fold or Stratified K-Fold ensures your model generalizes well and avoids overfitting. Try it on other real datasets. Follow me for more insights.

O. Olawale Awe, PhD, MBA.

Israel saibu Ph.D

African History with special focus on migration dynamics, culture, diaspora studies and Digital history

Prof we need a collaboration on this: Humanities and data science

1 Reaction

To view or add a comment, sign in

Cross-Validation in Machine Learning using Python

O. Olawale AWE, Ph.D., MBA

Data Scientist| Statistician|Professor|Consultant| Administrator| Mentor|Editor|Author

Introduction

Why Use Cross-Validation?

Types of Cross-Validation

Steps for Cross-Validation in Python

Python Implementation

1. Dataset Preparation

2. Hold-Out Validation

3. K-Fold Cross-Validation

Recommended by LinkedIn

4. Stratified K-Fold Cross-Validation

5. Leave-One-Out Cross-Validation (LOOCV)

6. Repeated K-Fold Cross-Validation

Tips for Cross-Validation

Advantages of Cross-Validation

Disadvantages of Cross-Validation

Conclusion

More articles by O. Olawale AWE, Ph.D., MBA

Insights from the community

Others also viewed

6 Reasons Why Python Can Ace AI and Machine Learning Applications?

A Comprehensive Guide to Feature Engineering for Machine Learning in Python

Python MACHINE LEARNING

Python Interview Questions Set 6

Python Image Handling Libraries

Python treatment for outliers in data science

Python Speech Recognition – Artificial Intelligence

Supervised Machine Learning With Python: Regression. Simple Linear Regression

Demystifying the Python

An introduction to simple linear regression model using python

Explore topics

Introduction

Why Use Cross-Validation?

Types of Cross-Validation

Steps for Cross-Validation in Python

Python Implementation

1. Dataset Preparation

2. Hold-Out Validation

3. K-Fold Cross-Validation

Recommended by LinkedIn

4. Stratified K-Fold Cross-Validation

5. Leave-One-Out Cross-Validation (LOOCV)

6. Repeated K-Fold Cross-Validation

Tips for Cross-Validation

Advantages of Cross-Validation

Disadvantages of Cross-Validation

Conclusion

More articles by O. Olawale AWE, Ph.D., MBA

A-Z of Research Publication: Your Comprehensive Guide to Academic Excellence!

Ten Types of Research in Data Science

Developing Your Research Skills in Data Science and Machine Learning.

The Power of Explainable AI: Bridging the Gap Between Technology and Trust

The No Free Lunch Theorem (NFLT): A Game-Changer for Data Science and Machine Learning

🌟 How I Choose the Right Algorithm for My Machine Learning Projects 🌟

100 Keywords in Data Science with Definitions

The Application of AI Across Various Fields: A Comprehensive A-Z Overview

Applications of Regression Analysis in Various Fields: A Comprehensive A-Z Overview

Call for Book Chapters - LISA 2020 Edited Volume Book

Insights from the community

Others also viewed

6 Reasons Why Python Can Ace AI and Machine Learning Applications?

A Comprehensive Guide to Feature Engineering for Machine Learning in Python

Python MACHINE LEARNING

Python Interview Questions Set 6

Python Image Handling Libraries

Python treatment for outliers in data science

Python Speech Recognition – Artificial Intelligence

Supervised Machine Learning With Python: Regression. Simple Linear Regression

Demystifying the Python

An introduction to simple linear regression model using python

Explore topics