Various Studies and Experts in Machine Learning / building Predictive Models suggest that about two-thirds of the effort needs to be dedicated to Data Understanding and Data Pre-processing Stages. The purpose of this blog is to cover the two techniques i.e. Anomaly Detection and Outlier Detection, that are used during the Data Understanding and Data Pre-processing stages.
Anomaly Detection is also a task on its own. Anomaly detection finds extensive use in a wide variety of applications such as fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security, fault detection in safety critical systems, and military surveillance for enemy activities
Before we dwell deeper into the these techniques, let us briefly familiarize ourselves with the concepts of “Anomalies” and “Outliers”.
Anomaly and Anomaly Detection:
In Data Science, Anomalies are referred to as data points (usually referred to multiple points), which do not conform to an expected pattern of the other items in the data set. Anomalies are referred to as a different distribution that occurs within a distribution. Anomalies in data translate to signiﬁcant (and often critical) actionable information in a wide variety of application domains.
Technically, this would happen when the observations or data-points are from a mixture of more than one distribution. This distribution (referring to the distribution of anomalous items in the data set) calls for more attention or investigation — usually signalling a change in the underlying conditions.
Anomalies might be induced in the data for a variety of reasons, such as malicious activity, e.g., credit card fraud, cyber-intrusion, terrorist activity or breakdown of a system, but all of the reasons have a common characteristic that they are interesting to the analyst. The “interestingness” or real life relevance of anomalies is a key feature of anomaly detection.
Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behaviour. The importance of anomaly detection is due to the fact that anomalies in data translate to significant (and often critical) actionable information in a wide variety of application domains.
In popular parlance, Anomaly detection finds extensive use in a wide variety of applications such as fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security, fault detection in safety critical systems, and military surveillance for enemy activities.
Outlier and Outlier Detection:
An Outlier is a rare chance of occurrence within a given data set. In Data Science, an Outlier is an observation point that is distant from other observations. An Outlier may be due to variability in the measurement or it may indicate experimental error.
Outliers, being the most extreme observations, may include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low. However, the sample maximum and minimum are not always outliers because they may not be unusually far from other observations.
Below is a simplistic representation of an Outlier:
While Outliers, are attributed to a rare chance and may not necessarily be fully explainable, Outliers in data can distort predictions and affect the accuracy, if you don’t detect and handle them.
The contentious decision to consider or discard an outlier needs to be taken at the time of building the model. Outliers can drastically bias/change the fit estimates and predictions. It is left to the best judgement of the analyst to decide whether treating outliers is necessary and how to go about it.
Treating or altering the outlier/extreme values in genuine observations is not a standard operating procedure. If a data point (or points) is excluded from the data analysis, this should be clearly stated on any subsequent report.
Is Anomaly Detection “more than what meets the eye”? :
Anomalies are patterns of different data within given data, whereas Outliers would be merely extreme data points within data. If not aggregated appropriately, anomalies may be neglected as outliers.
Anomalies could be explained by few features (may be new features). Through Anomaly Detection, understanding the pattern of anomalies, may lead to new findings (a new different model) or also, lead to new features that can be introduced in the existing model.
The outlier challenge is one of the earliest of statistical interests, and since nearly all data sets contain outliers of varying percentages, it continues to be one of the most important. Sometimes outliers can grossly distort the statistical analysis, at other times their inﬂuence may not be as noticeable. Statisticians have accordingly developed numerous algorithms for the detection and treatment of outliers, but most of these methods were developed for univariate data sets.
A univariate outlier is a data point that consists of an extreme value on one variable. Some of the Univariate Outlier Detection Techniques popularly used are “The Box Plot Rule”, Grubbs Test.
Declaring an observation as an outlier based on a just one (rather unimportant) feature could lead to unrealistic inferences. When you have to decide if an individual entity (represented by row or observation) is an extreme value or not, it better to collectively consider the features (X’s) that matter.
A multivariate outlier is a combination of unusual scores on at least two variables. Some of the Multivariate Outlier Detection Techniques popularly used are the Mahalanobis Distance, Cook’s Distance, etc.
Outlier Detection, on the other hand, leads to improving the model accuracy through treatment of outliers.