Handbook of Anomaly Detection: Chapter 1 Introduction
Chapter 1: Introduction
Insurance fraud, cyber hacking, malfunctioning equipment, and production failure are examples of rare events, and anomaly detection technology has been playing a critical role. In this chapter, I will start with real-world applications to understand the significance of anomaly detection. Then I will communicate the philosophy in selecting the algorithms in the book. Further, I will describe the general modeling procedure. This model procedure will be followed in the consequent chapters for either supervised or unsupervised learning.
(A) Insurance Fraud
Fraud is knowingly and willfully executing to defraud any benefits program or for financial gain. In insurance, fraud not only inflicts extra costs on insurance companies but also financially impacts consumers and businesses. The Coalition Against Insurance Fraud reports that fraud has cost businesses and consumers $308.6 billion a year [1]. The FBI estimates fraud costs the average family between $400 and $700 a year in premiums [2].
The growth of fraud has become unprecedented in recent years. Let’s take two popular types of insurance, the property and the auto insurance, to see how fraud happens. In home property insurance, claims about fire damage is one of the most expensive categories. False fire claims can happen in personal home claim or commercial property claim. A claimant exaggerates losses, or even deliberately plans a fire loss. Next to fire claims, fraud happens in theft and burglary claims. While most claims are genuine and honest, some claimants can exaggerate losses or may stage a theft loss to profit the subsequent claim payout. In auto insurance, staged false claims are not new to insurance professionals. For example, fraudsters can stage a car accident, and then plant witnesses to tell police the victim is at fault. Or a car owner can let his partner steal his car and sell it for parts and then files a car theft claim.
When we are about to apply data science for anomaly detection in this book, let’s use data science terminology to distinguish the difference between an outlier and a fraudulent event. Every variable has outliers, but not all outliers are fraudulent. For example, the number of gallons of milk in a typical retail trip is 1 or 2 gallons, and a large family or business can buy more than 5 gallons in a trip. The latter case is an outlier yet not fraudulent. Fraudulent activities tend to be in the outliers, so we focus on the outliers to detect them.
(B) Mechanic Failure Detection
Today’s modern, large-scale machines are well designed to ensure normal operations with very low failure rate. However, potential mechanic failures still can happen and result in unrecoverable loss. Monitoring systems are installed to signal any abnormal situations. Consider temperature sensors installed in many parts of a machine such as cooling fans, bearings, turbines, gears, and belts. The detection of abnormal situations gives engineers the lead time to react and prevent crisis from happening. For example, this research report [3] by the U.S. Office of Nuclear Energy shows how anomaly detection techniques are applied to nuclear power plants.
No one is likely to disagree with a proactive approach to guard against any possible situation. However, too many false alerts could result in time-consuming inspections and shutdowns and may delay the investigations that are truly needed. This trade-off points out an important criterion for an anomaly detection model – a high level of accuracy and low level of false positives. The challenge for data scientists is to develop models with accurate prediction very low level of false positives. That’s the aim of this book to survey a wide range of algorithms to help data scientists for better model outcome.
(C) Cybersecurity
Cyber-attacks happen due to malicious intent, negligent operations, or external malware attacks. Any single attack can destroy the financial assets of enterprises and result in unrecoverable loss. Intrusion detection systems help to monitor the traffics through firewalls, web gateways, and all the networks. Machine-learning algorithms are built in those systems to detect any malicious activity.
(D) What Are the Challenges in Anomaly Detection Modeling?
The challenge of anomaly detection modeling is either taking anomalies as normal patterns, or normal patterns as anomalies. If a model mis-classifies an anomaly as normal and lets it go unnoticed, the business may result in an unrecoverable loss. On the other hand, a model that creates too many false alarms is not helpful because it constantly disrupts regular operation. Thus, anomaly detection demands multiple algorithms to discover hidden data patterns, careful investigations for the rare events, and better understanding for the source of noises.
Outliers have three distinct properties: (1) Rare, (2) Heterogeneous, and (3) Evolving. The rarity of outliers implies the target is extremely imbalanced if a supervised learning model is used. Outliers are heterogeneous because they may include different kinds of rare events. Outliers can evolve because a fraudster can learn or invent new tricks to attack the systems.
These properties have direct implications for the selection of modeling algorithms. Unsupervised learning algorithms are naturally suitable for detecting new types of outliers. Unsupervised learning may provide less predictive performance when compared with supervised learning techniques. On the other hand, supervised learning is effective in detecting known types of outliers. Labeled outliers can come from several types of outliers. Supervised learning can target the types of outliers more precisely. The drawback is that it cannot detect novel types of anomalies if there are no labeled examples.
With the above reasons, unsupervised learning algorithms are recommended to explore different types of anomalies. After some time when different types of anomalies are verified and collected, supervised learning algorithms can be built to target anomalies of those specific types. For instance, the fire claim examples that we have mentioned earlier, a type of fraud is a staged arsenal crime, and another type is the exaggeration of fire loss. Both types of anomalies deviate from regular insurance claims. From a supervised learning perspective, both types still can be labeled as 1 for the target variable and the rest regular claims as 0. The supervised-learning model will be able to identify the two types of fraud with better precision and may bring up other types of suspicious claims like the two types.
The rarity of outliers implies the target will be extremely imbalanced in a supervised learning setting. Class imbalance can result in a serious bias towards the majority class, reducing the classification performance and increasing the number of false negatives. That’s why this handbook includes two chapters on over-sampling and under-sampling techniques for supervised learning.
(E) The Algorithms in This Book Are for Multivariate Data
Data can largely be categorized into Multivariate data, Serial data, and Image data. Multivariate data is probably the most common type. It can be presented in a data matrix that the rows correspond to individual observations and the columns correspond to variables. Serial data include any univariate data such as time series data, text sentences, or voice stream data. Image data refer to digital pictures or data produced by scanning with an electronic device. The algorithms in this book are best suited for multivariate data. Interested readers who want to learn the anomaly detection methods for time series data are recommended to read my book “Modern Time Series Anomaly Detection: With Python and R Examples” [4] or the series “Anomaly Detection for Time Series” [5] on Medium.com.
(F) How Do I Select the Algorithms in This Book?
Since there are so many anomaly detection algorithms and it is not necessary to cover all of them, I set up three justifications for the selection of algorithms. The first reason is algorithm speed. Model training and assessment is an iterative process and requires in-depth domain knowledge. It will be useful to highlight fast algorithms from a project management perspective. These algorithms include Empirical Cumulative Outlier Detection (ECOD), Histogram-Based Outlier Score (HBOS), and Isolation Forest (IForest).
The second reason for an algorithm to be included in the book is its popularity. Readers may have heard about Principal Component Analysis (PCA), K-nearest Neighbors (KNN), Gaussian Mixed Models (GMM), or One-Class Support Vector Machine (OCSVM). Some readers may have worked on these algorithms in their statistics course. The popularity of these algorithms makes them easy to understand and accepted by users for production.
The third reason for the algorithms in this book is their representation of similar algorithms. By learning these algorithms, readers can leverage the same ideas into other variations. These algorithms include Local Outlier Factor (LOF), Clustering-Based Local Outlier Factor (CBLOF), Extreme Boosting Based Outlier Detection (XGBOD), and Autoencoders. By learning XGBOD, readers can research other supervised learning methods. By learning Autoencoders, readers can extend to the entire class of deep learning algorithms, and likely to Variational Autoencoders (VAE).
Anomaly detection models can be classified into proximity-based, distribution-based, and ensemble-based algorithms. First, anomalies are rare events away from others, an intuitive way is to measure the nearness or closeness between data points. These are the proximity-based algorithms which include KNN, IForest, OCSVM, LOF, and CBLOF. Second, based on the concept of closeness, we can fit the distribution of a variable to spot outliers. This type of algorithm is called distribution-based. The HBOS, ECOD, and GMM are all distribution-based. Finally, with the advance in machine learning, some algorithms ensemble models to pursue better accuracy. This type is called ensemble-based. The IForest and XGBOD are ensemble-based.
These eleven algorithms in this book hopefully can provide depth and breadth in the algorithms space. Readers are recommended to compare multiple methods for the best outcome, and even advance to other methods not included in this book.
(G) Manage a Project from Unsupervised- to Supervised-Learning
Often, we need to use unsupervised learning setting because the lack of sufficient rare events as the target for a supervised learning model. Many unsupervised learning models, such as One-Side Support Vector Machine (OCSVM), are developed when the rare events are extremely difficult to find.
How to manage the outcome of an unsupervised model? The outcome of an unsupervised learning model shall be verified by human eyes if they are the desired outcome. For example, suppose an unsupervised learning model is developed for an insurance fraud application, the outliers identified by an unsupervised model will be investigated by the Special Investigation Unit (SIU) professionals.
Over time, there will be sufficient fraudulent claims. We will learn the percentage of true fraud in the population. We can label them as the target instances and develop supervised models. The model will aim at finding other instances similar to the target instances. Remember, the target instances in any supervised learning model were once verified in the past.
(H) Method & Modeling Procedure
The following modeling procedure will be followed throughout the book. This procedure will keep you focused on model development, assessment, and interpretation of the results.
When Step 1 is completed, a model will provide predicted outlier scores for all instances. A high outlier score for an instance indicates it is an outlier. However, we do not have the target to verify the outcome, we do not know the percentage of outliers. Even so, we still can use the outlier scores to understand the relative “outlier-ness” and identify those instances that depart from the regular instances.
In Step 2, we plot the outlier score in a histogram, then choose a threshold to separate normal from abnormal observations. The threshold determines the size of the abnormal group. If any prior brief suggests the percentage of anomalies should be no more than 1%, you can choose a threshold that results in approximately 1% of anomalies.
Step 3 is a critical step that communicates the soundness of the model. The descriptive statistics (such as the means and standard deviations) of the features between the two groups shall be carefully examined. If the mean of a feature is expected to be higher or lower in the abnormal group but the result is counter-intuitive, you are advised to review or modify the feature. For example, in a credit card fraud case, a fraudster wants to get money as large as possible in a short period of time. Outliers should be the transactions that deviate greatly from the monthly average. In the descriptive statistic table, it is expected the average transaction amount of the outlier group shall be higher than that of the normal group. If it is not the case, you shall examine the feature. You will iterate the modeling until all features make sense in a descriptive statistic table.
(I) More Public Outlier Detection Datasets
If you are looking for public datasets for model training, here are some data sources for you to consider. A well-organized one is the ODDS (Outlier Detection Datasets) (http://odds.cs.stonybrook.edu). It contains many datasets that represent various domains and types of data, such as multivariate data, time series data, time series graph data, video data, and cybersecurity data. Some of the datasets have labeled targets for supervised learning models as well. Below let me reference two datasets to explain their usage.
The first one is the MNIST dataset with an additional target label variable [6]. The MNIST dataset is a well-known dataset in the data science community. It contains handwritten digits originally provided by the U.S. Postal Service. The dataset in ODDS has an additional target field with value “1” for outliers and “0” for inliers. It is suitable for supervised learning. Interested readers can reference Bandaragoda, Tharindu R., et al. [7] for its target creation.
The second one is the wine dataset in ODDS [8]. It contains 13 wine attributes and 3 classes. Of the three wine classes, it down-samples Class 1 to only 10 instances to be the outliers. This dataset has the target label so is suitable for supervised learning. Interested readers can reference Sathe and Aggarwal [9] for its target creation.
(J) A Reminder
Feature engineering always is the most important step in any model. A carefully designed feature can be far more helpful than one thousand dummy variables that create a lot of false positives. The ancient Greek philosopher Heraclitus once said, “One man is worth a thousand if he is extraordinary.” We can say a feature is worth a thousand when it is well-designed.
References
https://a.co/d/9Ln1W3I