Intrusion Detection System and Confusion Matrix
Introduction
The advancement of pernicious programming (malware) represents a basic test to the plan of interruption recognition frameworks (IDS). Pernicious assaults have become more complex and the preeminent test is to recognize obscure and jumbled malware, as the malware creators utilize distinctive avoidance methods for data hiding to forestall location by an IDS. Furthermore, there has been an increment in security dangers like zero-day assaults intended to target web clients. In this manner, PC security has gotten fundamental as the utilization of data innovation has become part of our every day lives. Thus, different nations like Australia and the US have been fundamentally affected by the zero-day assaults. As indicated by the 2017 Symantec Internet Security Danger Report, in excess of three billion zero-day assaults were accounted for in 2016, and the volume and power of the zero-day assaults were significantly more prominent than already . As featured in the Data Break Statistics in 2017, around nine billion information records were lost or taken by hackers since 2013 . A Symantec report found that the quantity of safety penetrate episodes is on the rise. Before, cybercriminals essentially centered around bank clients, ransacking ledgers or taking Mastercards. Notwithstanding, the new age of malware has gotten more aggressive and is focusing on the actual banks, some of the time attempting to take a great many dollars in a single assault. For that explanation, the location of zero-day assaults has become the most noteworthy need. High profile episodes of cybercrime have illustrated the straightforwardness with which digital dangers can spread universally, as a basic trade off can upset a business' fundamental administrations or offices.
There are countless cybercriminals all throughout the planet spurred to take data, misguidedly get incomes, and discover new targets. Malware is deliberately made to settle PC frameworks and exploit any shortcoming in interruption discovery frameworks. In 2017, the Australian Cyber Security Center (ACSC) fundamentally inspected the extraordinary levels of refinement utilized by the aggressors . So there is a need to foster a proficient IDS to identify novel, refined malware. The point of an IDS is to distinguish various types of malware as right on time as could really be expected, which can't be accomplished by a customary firewall. With the expanding volume of PC malware, the turn of events of improved IDSs has gotten critical.
Over the most recent couple of many years, Machine Learning has been used to improve interruption identification, and at present there is a requirement for a cutting-edge, exhaustive scientific categorization and study of this new work. There are an enormous number of related examinations utilizing either the KDD-Cup 99 or DARPA 1999 dataset to approve the improvement of IDSs; anyway there is no reasonable response to the topic of which information mining procedures are more viable. Besides, the time taken for building IDS isn't considered in the assessment of some IDSs procedures, in spite of being a basic factor for the adequacy of 'on-line' IDSs.
Intrusion detection systems
Intrusion can be defined as any kind of unauthorized activities that cause damage to an information system. This means any attack that could pose a possible threat to the information confidentiality, integrity or availability will be considered an intrusion. For example, activities that would make the computer services unresponsive to legitimate users are considered an intrusion. An IDS is a software or hardware system that identifies malicious actions on computer systems in order to allow for system security to be maintained. The goal of an IDS is to identify different kinds of malicious network traffic and computer usage, which cannot be identified by a traditional firewall. This is vital to achieving high protection against actions that compromise the availability, integrity, or confidentiality of computer systems. IDS systems can be broadly categorized into two groups: Signature-based Intrusion Detection System (SIDS) and Anomaly-based Intrusion Detection System (AIDS).
Machine Learning in IDS
Machine learning is the process of extracting knowledge from large quantities of data. Machine learning models comprise of a set of rules, methods, or complex “transfer functions” that can be applied to find interesting data patterns, or to recognize or predict behaviour. Machine learning techniques have been applied extensively in the area of IDS. Several algorithms and techniques such as clustering, neural networks, association rules, decision trees, genetic algorithms, and nearest neighbour methods, have been applied for discovering the knowledge from intrusion datasets.
Performance Metrics in IDS
There are many classification metrics for IDS, some of which are known by multiple names. Below image shows the confusion matrix for a two-class classifier which can be used for evaluating the performance of an IDS. Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. Lets discuss what is Confusion Matrix.
Confusion Matrix
A Confusion matrix is the comparison summary of the predicted results and the actual results in any classification problem use case. The comparison summary is extremely necessary to determine the performance of the model after it is trained with some training data.
Actual Class 1 value= 1 which is similar to Positive value in a binary outcome.
Actual Class 2 value = 0 which is similar to a negative value in binary outcome.
There are various components that exist when we create a confusion matrix. The components are mentioned below
- Positive(P): The predicted result is Positive (Example: Image is a cat)
- Negative(N): the predicted result is Negative (Example: Images is not a cat)
True Positive(TP): Here TP basically indicates the predicted and the actual values is 1(True)
True Negative(TN): Here TN indicates the predicted and the actual value is 0(False)
False Negative(FN): Here FN indicates the predicted value is 0(Negative) and Actual value is 1. Here both values do not match. Hence it is False Negative.
False Positive(FP): Here FP indicates the predicted value is 1(Positive) and the actual value is 0. Here again both values mismatches. Hence it is False Positive.
Performance metrics for IDS
IDS are typically evaluated based on the following standard performance measures:
True Positive Rate (TPR): It is calculated as the ratio between the number of correctly predicted attacks and the total number of attacks. If all intrusions are detected then the TPR is 1 which is extremely rare for an IDS. TPR is also called a Detection Rate (DR) or the Sensitivity. The TPR can be expressed mathematically as
TPR = TP/ (TP+ FN)
False Positive Rate (FPR): It is calculated as the ratio between the number of normal instances incorrectly classified as an attack and the total number of normal instances.
FPR = FP/( FP + TN )
False Negative Rate (FNR): False negative means when a detector fails to identify an anomaly and classifies it as normal. The FNR can be expressed mathematically as:
FNR = FN /(FN + TP)
Classification rate (CR) or Accuracy: The CR measures how accurate the IDS is in detecting normal or anomalous traffic behavior. It is described as the percentage of all those correctly predicted instances to all instances:
Accuracy =(TP + TN)/( TP+ TN + FP + FN)
Receiver Operating Characteristic (ROC) curve: ROC has FPR on the x-axis and TPR on the y-axis. In ROC curve the TPR is plotted as a function of the FPR for different cut-off points. Each point on the ROC curve represents a FPR and TPR pair corresponding to a certain decision threshold. As the threshold for classification is varied, a different point on the ROC is selected with different False Alarm Rate (FAR) and different TPR. A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity).
Intrusion detection datasets
The evaluation datasets play a vital role in the validation of any IDS approach, by allowing us to assess the proposed method’s capability in detecting intrusive behavior. The datasets used for network packet analysis in commercial products are not easily available due to privacy issues. However, there are a few publicly available datasets such as DARPA, KDD, NSL-KDD and ADFA-LD and they are widely used as benchmarks. Existing datasets that are used for building and comparative evaluation of IDS are discussed in this section along with their features and limitations.
DARPA / KDD Cup99
The earliest effort to create an IDS dataset was made by DARPA (Defence Advanced Research Project Agency) in 1998 and they created the KDD98 (Knowledge Discovery and Data Mining (KDD)) dataset. In 1998, DARPA introduced a programme at the MIT Lincoln Labs to provide a comprehensive and realistic IDS benchmarking environment. Although this dataset was an important contribution to the research on IDS, its accuracy and capability to consider real-life conditions have been widely criticized
CAIDA
This dataset contains network traffic traces from Distributed Denial-of-Service (DDoS) attacks, and was collected in 2007. This type of denial-of-service attack attempts to interrupt normal traffic of a targeted computer, or network by overwhelming the target with a flood of network packets, preventing regular traffic from reaching its legitimate destination computer. One disadvantage of the CAIDA dataset is that it does not contain a diversity of the attacks. In addition, the gathered data does not contain features from the whole network which makes it difficult to distinguish between abnormal and normal traffic flows.
NSL-KDD
NSL-KDD is a public dataset, which has been developed from the earlier KDD cup99 dataset . A statistical analysis performed on the cup99 dataset raised important issues which heavily influence the intrusion detection accuracy, and results in a misleading evaluation of AIDS.
Thanks for Reading!!!!