Get your machine learning programs right every time - most comprehensive guide ever ( with code)!

Krishna Yogi Kolluru

Data Science Architect | ML | GenAI | Sagemaker | Speaker | ex-Microsoft | IIT - NUS Alumni | AWS & Databricks Certified Data Engineer

Published Sep 29, 2019

As a data scientist, I often would fumble once in a while to apply the right algo for the given problem and end up spending quite a bit of time in the search.

The idea of this article is to ensure we follow the scientific approach in dealing with machine learning algo search space, there will be a similar article for deep learning here.

There are 11 parts to a modern machine learning pipeline:

Data Peeking
Visualization
Data scaling
Encoding
Feature selection
Train and Test splits
Performance metrics
Multi-algo search (Classification and Regression )
Ensemble methods
Performance Tuning
Save and Load models
Dimensionality Reduction ( pending )

Data peeking is the very first step in data science.

After importing the data, things to watch out for.

data.head() - prints the first 10 rows of your data so you can physically inspect the data.
data.describe() - describes your data in detailed statistics
data.dtypes() - lists out the various data types so that you can observe
data.skew() . - list the skew in the data
data.groupby('class').size() - class distribution ( useful for classification type problems)
data.corr(method='pearson') - Correlations Between Attributes
data.skew() - Skew of Univariate Distributions

Visualization helps in litterally visualizing and often giving far better insights into data

Univariate Plots

Histograms -
Density Plots.
Box and Whisker Plots.

Multivariate Plots

Correlation Matrix Plot.
Scatter Plot Matrix.

Scaling your data to suit the algorithm

different algorithms require your data to be scaled differently need to be watchful about this.

Rescale data

Most algos need data to be scaled to 0-1 so that they can perform optimally.

scaler = MinMaxScaler(feature_range=(0, 1))

Standardize data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

StandardScaler().fit(X)

Normalize data

Normalizing involves rescaling each observation (row) to have a length of 1 (called a unit norm or a vector with the length of 1 in linear algebra). This pre-processing method can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors.

Normalizer().fit(X)

Binarize data

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

Binarizer(threshold=0.0).fit(X)

Encoding - Encoding is extremely helpful in cases where you have categorical data as machine learning algos cannot deal with him.

There are two main types of encoders

Label encoder - Label encoding converts categorical data to respective numbers Labelencoder.fit_transform()

One hot encoder

One hot encoding is more sophisticated than simple label encoding, one hot creates separate columns for each categorical variable ensuring the algos don't treat label as simple numbers and try the extracted value.

Feature Selection

Three benefits of performing feature selection before modeling your data are:

Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.

Improves Accuracy: Less misleading data means modeling accuracy improves.

Reduces Training Time: Less data means that algorithms train faster.

Techniques

Univariate Selection. Statistical tests can be used to select those features that have the strongest relationship with the output variable.

SelectKBest(score_func=chi2, k = 4) ( this one uses chi-square test and selects 4 best features)

Recursive Feature Elimination works by recursively removing attributes and building a model on those attributes that remain.

Principle Component Analysis PCA) uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique.

Feature Importance Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

Splitting your data into train and test sets

Train and Test Sets - Simply split your data into train and test sets so that you can perform a basic testing

k-fold Cross Validation works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The algorithm is trained on k − 1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.

Leave One Out Cross-Validation You can configure cross-validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). This variation of cross-validation is called leave-one-out cross-validation. The downside is its computationally very expensive.

Repeated Random Test-Train Splits. Another variation on k-fold cross-validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross-validation. This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross-validation.

K-fold CV is generally the most preferred option provided you have reasonable time and computing power at hand.

Performance Metrics

Performance metrics are the metrics that are used to measure how good a model is performing and hence decides on its optimization strategies making the metrics a vital part of model building.

Classification Accuracy Classification accuracy is the number of correct predictions made as a ratio of all predictions made

Logarithmic Loss Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class. The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm.

Area Under ROC Curve

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems. The AUC represents a model’s ability to discriminate between positive and negative classes or more precisely a plot between TPR and FPR. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model that is as good as random. ROC can be broken down into sensitivity( TPR or recall) and specificity ( 1- FPR).

Confusion Matrix

The confusion matrix is a handy presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

Classification Report

The scikit-learn library provides a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures. The classification report() function displays the precision, recall, F1-score and support for each class.

Mean Absolute Error.

The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. It gives an idea of how wrong the predictions were.

Mean Squared Error.

The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of the error. Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation.

The R2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination. This is a value between 0 and 1 for no-fit and perfect fit respectively.

Algo Search - Classification

If you have read this far, I have to thank you for your patience. Having said that this is obviously the most important part of your ML programming.

here are two linear machine learning algorithms:

Logistic Regression

Logistic regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems. You can construct a logistic regression model using the LogisticRegression class

Linear Discriminant Analysis

Linear Discriminant Analysis or LDA is a statistical technique for binary and multiclass classification. It too assumes a Gaussian distribution for the numerical input variables.

Then let's look at four nonlinear machine learning algorithms:

k-Nearest Neighbors

The k-Nearest Neighbors algorithm (or KNN) uses a distance metric to find the k most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.

Naive Bayes

Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption). When working with real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for input variables using the Gaussian Probability Density Function.

Classification and Regression Trees

Classification and Regression Trees (CART or just decision trees) construct a binary tree from the training data. Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (like the Gini index).

Support Vector Machines

Support Vector Machines (or SVM) seek a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors and influence where the line is placed. SVM has been extended to support multiple classes. Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default. You can construct an SVM model using the SVC class

Algo Search - Regression

Linear Regression.

Linear regression assumes that the input variables have a Gaussian distribution. It is also assumed that input variables are relevant to the output variable and that they are not highly correlated with each other (a problem called collinearity). You can construct a linear regression model using the LinearRegression class

Ridge Regression.

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model measured as the sum squared value of the coefficient values (also called the L2-norm). You can construct a ridge regression model by using the Ridge class

LASSO Linear Regression

The Least Absolute Shrinkage and Selection Operator (or LASSO for short) is a modification of linear regression, like ridge regression, where the loss function is modified to minimize the complexity of the model measured as the sum the absolute value of the coefficient values (also called the L1-norm). You can construct a LASSO model by using the Lasso class

Elastic Net Regression

ElasticNet is a form of regularization regression that combines the properties of both Ridge Regression and LASSO regression. It seeks to minimize the complexity of the regression model (magnitude and number of regression coefficients) by penalizing the model using both the L2-norm (sum squared coefficient values) and the L1-norm (sum absolute coefficient values).

Now let's learn to compare all machine learning algos for a given problem in one shot

The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data. You can achieve this by forcing each algorithm to be evaluated on a consistent test harness.

# Compare Algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))


# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = KFold(n_splits=10, random_state=7)
cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
results.append(cv_results)

names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

Improve Performance with Ensembles

The three most popular methods for combining the predictions from different models are:

Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.

Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models.

Voting. Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.

Performance Tuning

Algorithm tuning is a final step in the process of applied machine learning before finalizing your model. This is one of the cheats that can improve performance often dramatically.

It is sometimes called hyperparameter optimization where the algorithm parameters are referred to as hyperparameters, whereas the coefficients found by the machine learning algorithm itself are referred to as parameters.

Grid Search Parameter Tuning

Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. You can perform a grid search using the GridSearchCV class

Random Search Parameter Tuning

Random search is an approach to parameter tuning that will sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations. A model is constructed and evaluated for each combination of parameters chosen. You can perform a random search for algorithm parameters using the RandomizedSearchCV class.

Algorithm parameter tuning is an important step for improving algorithm performance right before presenting results or preparing a system for production.

Save and Load Models

Saving and loading models are often ignored most inexperienced data scientists but can help you in giving robustness to your work. Once you save your model by pickling it, you can simply load it from there instead of running from scratch and end up saving quite a bit of time and CPU.

Preserve your model with Pickle ( pun intended )

Pickle is the standard way of serializing objects in Python. You can use the pickle1 operation to serialize your machine learning algorithms and save the serialized format to a file. Later you can load this file to deserialize your model and use it to make new predictions.

from pickle import dump, load

# save the model to disk
filename = 'finalized_model.sav'
dump(model, open(filename, 'wb'))

# load the model from disk
loaded_model = load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)

#

Finalize Your Model with Joblib

The Joblib library is part of the SciPy ecosystem and provides utilities for pipelining Python jobs. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently. This can be useful for some machine learning algorithms that require a lot of parameters or store the entire dataset (e.g. k-Nearest Neighbors).

from sklearn.externals.joblib import dump
# save the model to disk
filename = 'finalized_model.sav'
dump(model, open(filename, 'wb'))

# load the model from disk
loaded_model = load(open(filename, 'rb'))
#

That's it fellas, hope you liked this, feel free to correct my pipeline with any comments and suggestions.

References:

Various Articles from Medium.com

Jason Brownlee's Machine learning mastery blogs.

Get your machine learning programs right every time - most comprehensive guide ever ( with code)!

Krishna Yogi Kolluru

Data Science Architect | ML | GenAI | Sagemaker | Speaker | ex-Microsoft | IIT - NUS Alumni | AWS & Databricks Certified Data Engineer

More articles by this author

Insights from the community

Others also viewed

Unlock the Power of Machine Learning in Data Science & AI

Relation between statistical machine learning and big data

Understanding Tabular Data with SHAP: A Comprehensive Guide

KD 17:n01: 5 Machine Learning Projects You Can’t Overlook; Future of Deep Learning

Knowledge graphs for Machine Learning are so cool !

Group Think: A Deep Dive into the World of Clustering Algorithms

The Hottest Tools in Machine Learning and Data Science in 2024 (Part 3)

Hypothesis Testing in Machine Learning

Beyond ML and DL: Understanding Measurement Models in Data Science

Machine Learning for Developers (ML4Devs Newsletter, Issue 1)

Explore topics

Mastering Spark SQL Functions: A Comprehensive Guide

Sep 2, 2024

100 Data Engineering Jargon That You Must Know

Aug 27, 2024

Slowly Changing Dimensions in Data Warehouses

Aug 17, 2024

VectorDB Tutorial — A Beginner’s Guide

Jul 27, 2024

Databricks SQL Series — Part 5 — Managing and Securing Your Data

Jul 26, 2024

Databricks SQL Series: Integrating Databricks SQL with Visualization Tools — Part 4

Jul 26, 2024

Databricks SQL Series: Advanced Analytics in Databricks SQL — Using Window Functions — Part 3

Jul 25, 2024

Databricks SQL Series — Optimizing Data Queries with Databricks SQL — Part 2