Machine Learning - All you need to know about Outliers
Identify, visualise and fill outliers in Python
Content
How to identify Outliers in the Dataset?
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses. — Wikipedia
In other words the data that contains outliers are defined as observations that are far from the others. We should craft our assumptions about what is a “normal” expected value and be careful on not removing values which our important outliers (i.e. fraud).
1. Finding Outliers with Z-score:
The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.
In theory, 99.7% of the data points of a normally distributed data set will be between 3 and -3 standard deviation away from the mean as shown in the figure below. That means that all the values with a standard deviation above 3 or below -3 will be considered as outliers.
2. Finding Outliers with Elliptical Envelope Method:
The Elliptical Envelope method detects the outliers in a Gaussian distributed data. It creates an imaginary elliptical area around a given dataset as shown below in the green dotted circle. Values that fall inside the envelope are considered normal data and anything outside the envelope is returned as outliers.
3. Finding Outliers with Interquartile Range:
The interquartile range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles.
The interquartile range is useful in detecting the outliers in the dataset. Quartiles segment any distribution that’s ordered from low to high into four equal parts. The interquartile range (IQR) contains the second and third quartiles, or the middle half of your data set.
Any values that fall outside the lower bound or the upper bound are classified as the outliers.
How to visualise Outliers in the Dataset?
1. Heatmap:
A heatmap is a graphical representation of data in two-dimension, using colors to demonstrate different factors. In the below example we have created a function std_outlier which calculates the Z-score and creates a lower and upper bound. Finally, we have created an if statement in the function which returns True in case the value is in between the lower and upper bound or returns False in case the value is outside the lower and upper bound. Which is then used to create the heatmap to show the number of times the row is False in the dataframe as that would be highlighted in different colors.
df = df[['temp', 'rain', 'msl', 'dewpt', 'rhum','demand','Day sin','Day cos','Year sin','Year cos','Wx','Wy']]
def std_outlier(st, nstd=3.0, return_thresholds=False):
data_mean, data_std = st.mean(), st.std()
cut_off = data_std * nstd
lower, upper = data_mean - cut_off, data_mean + cut_off
if return_thresholds:
return lower, upper
else:
return [False if x < lower or x > upper else True for x in st]
std2 = df.apply(std_outlier, nstd=3.0)
std3 = df.apply(std_outlier, nstd=4.0)
std4 = df.apply(std_outlier, nstd=5.0)
f, ((ax1, ax2, ax3)) = plt.subplots(ncols=3, nrows=1, figsize=(22, 12));
ax1.set_title('Outliers with 3 standard deviations');
ax2.set_title('Outliers with 4 standard deviations');
ax3.set_title('Outliers with 5 standard deviations');
sns.heatmap(std2, cmap='Reds', ax=ax1);
sns.heatmap(std3, cmap='Reds', ax=ax2);
sns.heatmap(std4, cmap='Reds', ax=ax3);
plt.show()
1. Boxplot with Histogram:
A boxplot is a standardised way for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes indicating variability outside the upper and lower quartiles. When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.
fig = px.histogram(df, x="temp", marginal="box")
fig.show()
Recommended by LinkedIn
1. Scatter Plot:
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
When reviewing a scatter plot, an outlier is defined as a data point or points that are located farthest from the other points or an average line.
# Data
x=df['demand']
y=df['temp']
# Layout
layout = go.Layout(
title="Demand Based on Temperature",
xaxis=dict(
title="Electricity Demand (MW)"
),
yaxis=dict(
title="Temperature (Celcius)"
)
)
fig = go.Figure(layout = layout)
# Add scatter trace with medium sized markers
fig.add_trace(
go.Scatter(
mode='markers',
x=x,
y=y,
marker=dict(
color='darkturquoise',
size=6,
opacity=0.4,
line=dict(
color='burlywood',
width=1
)
),
showlegend=False
)
)
fig.show()
You can also view the data on 3D Scatter Plot by adding the Date Time dimension to it, which would help in understanding the outliers better as what is a “normal” expected value. As some of the data points which could seem to be an outlier on 2D plot but happens every year could seem to be an “normal” expected value (i.e. Temperature is -7 every year in December).
# Data
x=df['demand']
y=df['temp']
z=df['date']
fig = go.Figure()
# Add scatter trace with medium sized markers
fig.add_trace(
go.Scatter3d(
mode='markers',
x=x,
y=y,
z=z,
marker=dict(
color='darkturquoise',
size=4,
opacity=0.3,
line=dict(
color='burlywood',
width=1
)
),
showlegend=False
)
)
# xaxis.backgroundcolor is used to set background color
fig.update_layout(scene = dict(
xaxis_title='Electricity Demand (MW)',
yaxis_title='Temperature (Celcius)',
zaxis_title='Date',
xaxis = dict(
backgroundcolor="rgb(209, 209, 209)",
gridcolor="white",
showbackground=True,
zerolinecolor="white",),
yaxis = dict(
backgroundcolor="rgb(187, 200, 210)",
gridcolor="white",
showbackground=True,
zerolinecolor="white"),
zaxis = dict(
backgroundcolor="rgb(233, 245, 255)",
gridcolor="white",
showbackground=True,
zerolinecolor="white",),),
width=700,
margin=dict(
r=10, l=10,
b=10, t=10)
)
fig.show()
How to fill Outliers in Python?
1. Filling Outliers with Z-score:
In the below example, we are filling the outliers based on calculating the Z-score. In the zscore_outliers function we first of all categorise the columns based on their types. As we want to find the outliers on the columns that have continuous values. In my dataframe all the continuous values are of type float but you can also include integers in case you have integer values in your dataframe. We also categorise ID column and Target column in the function separately and calculate the Z-score for all the columns that have continuous values which then returns True in case the absolute values are lower than 3 or False in case they are higher than 3. Then we fill the outliers based on the interpolate method by linear values. Finally, we concatenate all the columns together and return the data frame with no outliers.
def zscore_outliers(skewed_dataset):
float_columns = list(skewed_dataset.select_dtypes(include=['float64']).columns)
str_columns = list(skewed_dataset.select_dtypes(include=['object']).columns)
db_id = skewed_dataset['ID']
db1 = skewed_dataset[float_columns]
db2 = skewed_dataset[str_columns]
target = skewed_dataset['TARGET_FLAG']
z = (np.abs(stats.zscore(db1)) > 3)
unskewed_dataset = db1.mask(z).interpolate(method = 'linear')
db = pd.concat([db_id, unskewed_dataset, db2, target], axis = 1)
return db
model_dataset = zscore_outliers(df)
2. Filling Outliers with Interquartile Range:
In the below example, we are filling the outliers based on calculating the Interquartile Range. In the iqr function we have our logic to calculate lower and upper bound based on Interquartile Range, which we will mask to all the values of the column in the data frame with continuous values. This function will return True in case the values are in between the lower and upper bound or False in case values are outside the lower and upper bound.
In the iqr_outliers function we first of all categorise the columns based on their types. As we want to find the outliers on the columns that have continuous values. In my dataframe all the continuous values are of type float but you can also include integers in case you have integer values in your dataframe. We also categorise ID column and Target column in the function separately and calculate the Interquartile Range for all the columns that have continuous values and fill the outliers based on the interpolate method by linear values. Finally, we concatenate all the columns together and return the data frame with no outliers.
def iqr(x):
Q1, Q3 = np.percentile(x, [25, 75])
IQR = Q3-Q1
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)
outlier = (x < lower_bound) | (x > upper_bound)
return outlier
def iqr_outliers(df):
float_columns = list(skewed_dataset.select_dtypes(include=['float64']).columns)
str_columns = list(skewed_dataset.select_dtypes(include=['object']).columns)
db_id = skewed_dataset['ID']
db1 = skewed_dataset[float_columns]
db2 = skewed_dataset[str_columns]
target = skewed_dataset['TARGET_FLAG']
iqr_ = iqr(db1)
unskewed_dataset = db1.mask(iqr_).interpolate(method = 'linear')
db = pd.concat([db_id, unskewed_dataset, db2, target], axis = 1)
return db
model_dataset = iqr_outliers(df)
3. Removing Outliers with Elliptical Envelope Method:
In the below example, we are removing the outlier values based on the Elliptical Envelope method. First of all we will load the package from sklearn.covariance called EllipticEnvelope. Then we assign the method to the variable cov and pass the parameters in the method such as:
Then we fit the method to our dataframe with the fit_predict method, which returns 1 for inliers and -1 for outliers.
from sklearn.covariance import EllipticEnvelope
cov = EllipticEnvelope(contamination=0.02, random_state=1)
# Returns 1 of inliers, -1 for outliers
pred = cov.fit_predict(data)
# Remove Outliers
pred_ = pd.DataFrame(pred, columns=['outliers'])
data_ = pd.concat([data, pred_], axis=1)
model_dataset = data_[data_['outliers']==1]
Finally, we concatenate the values returned by the method in the column called outliers to see which rows are the outliers in the dataframe and then we remove those outliers by filtering the data frame based on the outliers column where the values are equal to 1.
Summary
In this article, we covered what is an Outlier? How can we identify an Outlier? How can we visualise an Outlier? How to fill Outliers in Python?
Framework: Jupyter Notebook, Language: Python, Libraries: sklearn, pandas, numpy, seaborn, plotly, math and matplotlib.