Machine Learning - All you need to know about Outliers
Photo by Daniel Honies on Unsplash

Machine Learning - All you need to know about Outliers

Identify, visualise and fill outliers in Python

Content

  • How to identify Outliers in the Dataframe?
  • How to visualise Outliers in the Dataframe?
  • How to fill Outliers in Python?

How to identify Outliers in the Dataset?

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses. — Wikipedia

In other words the data that contains outliers are defined as observations that are far from the others. We should craft our assumptions about what is a “normal” expected value and be careful on not removing values which our important outliers (i.e. fraud).

1. Finding Outliers with Z-score:

The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.

In theory, 99.7% of the data points of a normally distributed data set will be between 3 and -3 standard deviation away from the mean as shown in the figure below. That means that all the values with a standard deviation above 3 or below -3 will be considered as outliers.

Image created by author

2. Finding Outliers with Elliptical Envelope Method:

The Elliptical Envelope method detects the outliers in a Gaussian distributed data. It creates an imaginary elliptical area around a given dataset as shown below in the green dotted circle. Values that fall inside the envelope are considered normal data and anything outside the envelope is returned as outliers.

Image created by author

3. Finding Outliers with Interquartile Range:

The interquartile range (IQR), also called the midspread, middle 50%, or H‑spread, is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles.

The interquartile range is useful in detecting the outliers in the dataset. Quartiles segment any distribution that’s ordered from low to high into four equal parts. The interquartile range (IQR) contains the second and third quartiles, or the middle half of your data set.

No alt text provided for this image

  • Lower Quartile (Q1) = (N+1) * 1 / 4
  • Middle Quartile (Q2) = (N+1) * 2 / 4
  • Upper Quartile (Q3 )= (N+1) * 3 / 4
  • Interquartile Range = Q3 - Q1

Any values that fall outside the lower bound or the upper bound are classified as the outliers.

How to visualise Outliers in the Dataset?

1. Heatmap:

A heatmap is a graphical representation of data in two-dimension, using colors to demonstrate different factors. In the below example we have created a function std_outlier which calculates the Z-score and creates a lower and upper bound. Finally, we have created an if statement in the function which returns True in case the value is in between the lower and upper bound or returns False in case the value is outside the lower and upper bound. Which is then used to create the heatmap to show the number of times the row is False in the dataframe as that would be highlighted in different colors.

df = df[['temp', 'rain', 'msl', 'dewpt', 'rhum','demand','Day sin','Day cos','Year sin','Year cos','Wx','Wy']]
	
    def std_outlier(st, nstd=3.0, return_thresholds=False):
	    data_mean, data_std = st.mean(), st.std()
	    cut_off = data_std * nstd
	    lower, upper = data_mean - cut_off, data_mean + cut_off
	    if return_thresholds:
	        return lower, upper
	    else:
	        return [False if x < lower or x > upper else True for x in st]
	

std2 = df.apply(std_outlier, nstd=3.0)
std3 = df.apply(std_outlier, nstd=4.0)
std4 = df.apply(std_outlier, nstd=5.0)
	

f, ((ax1, ax2, ax3)) = plt.subplots(ncols=3, nrows=1, figsize=(22, 12));
ax1.set_title('Outliers with 3 standard deviations');
ax2.set_title('Outliers with 4 standard deviations');
ax3.set_title('Outliers with 5 standard deviations');
	
sns.heatmap(std2, cmap='Reds', ax=ax1);
sns.heatmap(std3, cmap='Reds', ax=ax2);
sns.heatmap(std4, cmap='Reds', ax=ax3);
	
        
plt.show()
        
No alt text provided for this image

1. Boxplot with Histogram:

A boxplot is a standardised way for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes indicating variability outside the upper and lower quartiles. When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.

fig = px.histogram(df, x="temp", marginal="box")        
	

fig.show()        
No alt text provided for this image

1. Scatter Plot:

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

When reviewing a scatter plot, an outlier is defined as a data point or points that are located farthest from the other points or an average line.

# Data
	
x=df['demand']
y=df['temp']
	

# Layout
layout = go.Layout(
	   title="Demand Based on Temperature",
	   xaxis=dict(
	       title="Electricity Demand (MW)"
	   ),
	   yaxis=dict(
	       title="Temperature (Celcius)"
	   ) 
)
	

fig = go.Figure(layout = layout)
	

# Add scatter trace with medium sized markers
fig.add_trace(
	   go.Scatter(
	       mode='markers',
	       x=x,
	       y=y,
	       marker=dict(
	           color='darkturquoise',
	           size=6,
	           opacity=0.4,
	           line=dict(
	               color='burlywood',
	               width=1
	           )
	       ),
	       showlegend=False
	   )
)        
	

fig.show()        
No alt text provided for this image

You can also view the data on 3D Scatter Plot by adding the Date Time dimension to it, which would help in understanding the outliers better as what is a “normal” expected value. As some of the data points which could seem to be an outlier on 2D plot but happens every year could seem to be an “normal” expected value (i.e. Temperature is -7 every year in December).

# Data
x=df['demand']
y=df['temp']
z=df['date']
	

fig = go.Figure()
	

# Add scatter trace with medium sized markers
fig.add_trace(
	   go.Scatter3d(
	        mode='markers',
	        x=x,
	        y=y,
	        z=z,
	        marker=dict(
	            color='darkturquoise',
	            size=4,
	            opacity=0.3,
	            line=dict(
	                color='burlywood',
	                width=1
	            )
	        ),
	        showlegend=False
	    )
	)
	

# xaxis.backgroundcolor is used to set background color
fig.update_layout(scene = dict(
	                    xaxis_title='Electricity Demand (MW)',
	                    yaxis_title='Temperature (Celcius)',
	                    zaxis_title='Date',
	                    xaxis = dict(
	                         backgroundcolor="rgb(209, 209, 209)",
	                         gridcolor="white",
	                         showbackground=True,
	                         zerolinecolor="white",),
	                    yaxis = dict(
	                        backgroundcolor="rgb(187, 200, 210)",
	                        gridcolor="white",
	                        showbackground=True,
	                        zerolinecolor="white"),
	                    zaxis = dict(
	                        backgroundcolor="rgb(233, 245, 255)",
	                        gridcolor="white",
	                        showbackground=True,
	                        zerolinecolor="white",),),
	                    width=700,
	                    margin=dict(
	                    r=10, l=10,
	                    b=10, t=10)
	                  )
	
        
fig.show()        
No alt text provided for this image

How to fill Outliers in Python?

1. Filling Outliers with Z-score:

In the below example, we are filling the outliers based on calculating the Z-score. In the zscore_outliers function we first of all categorise the columns based on their types. As we want to find the outliers on the columns that have continuous values. In my dataframe all the continuous values are of type float but you can also include integers in case you have integer values in your dataframe. We also categorise ID column and Target column in the function separately and calculate the Z-score for all the columns that have continuous values which then returns True in case the absolute values are lower than 3 or False in case they are higher than 3. Then we fill the outliers based on the interpolate method by linear values. Finally, we concatenate all the columns together and return the data frame with no outliers.

def zscore_outliers(skewed_dataset):
	  float_columns = list(skewed_dataset.select_dtypes(include=['float64']).columns)
	  str_columns = list(skewed_dataset.select_dtypes(include=['object']).columns)
	  db_id = skewed_dataset['ID']
	  db1 = skewed_dataset[float_columns]
	  db2 = skewed_dataset[str_columns]
	  target = skewed_dataset['TARGET_FLAG']
	  z = (np.abs(stats.zscore(db1)) > 3)
	  unskewed_dataset = db1.mask(z).interpolate(method = 'linear')
	  db = pd.concat([db_id, unskewed_dataset, db2, target], axis = 1)
	  return db
	
        
model_dataset = zscore_outliers(df)        

2. Filling Outliers with Interquartile Range:

In the below example, we are filling the outliers based on calculating the Interquartile Range. In the iqr function we have our logic to calculate lower and upper bound based on Interquartile Range, which we will mask to all the values of the column in the data frame with continuous values. This function will return True in case the values are in between the lower and upper bound or False in case values are outside the lower and upper bound.

In the iqr_outliers function we first of all categorise the columns based on their types. As we want to find the outliers on the columns that have continuous values. In my dataframe all the continuous values are of type float but you can also include integers in case you have integer values in your dataframe. We also categorise ID column and Target column in the function separately and calculate the Interquartile Range for all the columns that have continuous values and fill the outliers based on the interpolate method by linear values. Finally, we concatenate all the columns together and return the data frame with no outliers.

def iqr(x):
	  Q1, Q3 = np.percentile(x, [25, 75])
	  IQR = Q3-Q1
	  lower_bound = Q1 - (1.5 * IQR)
	  upper_bound = Q3 + (1.5 * IQR)
	  outlier = (x < lower_bound) | (x > upper_bound)
	  return outlier
	  
def iqr_outliers(df):
	  float_columns = list(skewed_dataset.select_dtypes(include=['float64']).columns)
	  str_columns = list(skewed_dataset.select_dtypes(include=['object']).columns)
	  db_id = skewed_dataset['ID']
	  db1 = skewed_dataset[float_columns]
	  db2 = skewed_dataset[str_columns]
	  target = skewed_dataset['TARGET_FLAG']
	  iqr_ = iqr(db1)
	  unskewed_dataset = db1.mask(iqr_).interpolate(method = 'linear')
	  db = pd.concat([db_id, unskewed_dataset, db2, target], axis = 1)
	  return db
	
        
model_dataset = iqr_outliers(df)        

3. Removing Outliers with Elliptical Envelope Method:

In the below example, we are removing the outlier values based on the Elliptical Envelope method. First of all we will load the package from sklearn.covariance called EllipticEnvelope. Then we assign the method to the variable cov and pass the parameters in the method such as:

  • contamination which is the amount of contamination of the data set, i.e. the proportion of outliers in the data set. Range is (0, 0.5).
  • random_state which determines the pseudo random number generator for shuffling the data. Pass an int for reproducible results across multiple function calls.

Then we fit the method to our dataframe with the fit_predict method, which returns 1 for inliers and -1 for outliers.

from sklearn.covariance import EllipticEnvelope
cov = EllipticEnvelope(contamination=0.02, random_state=1)

# Returns 1 of inliers, -1 for outliers
pred = cov.fit_predict(data)

# Remove Outliers
pred_ = pd.DataFrame(pred, columns=['outliers'])
data_ = pd.concat([data, pred_], axis=1)        
	
model_dataset = data_[data_['outliers']==1]        

Finally, we concatenate the values returned by the method in the column called outliers to see which rows are the outliers in the dataframe and then we remove those outliers by filtering the data frame based on the outliers column where the values are equal to 1.

Summary

In this article, we covered what is an Outlier? How can we identify an Outlier? How can we visualise an Outlier? How to fill Outliers in Python?

Framework: Jupyter Notebook, Language: Python, Libraries: sklearn, pandas, numpy, seaborn, plotly, math and matplotlib.

To view or add a comment, sign in

More articles by Gaurav Pahuja

Insights from the community

Others also viewed

Explore topics