Data Cleaning 101 in SQL — # 3.1 A discussion on the Nature of Outliers

Brian Tran
7 min readAug 19, 2023

--

I like weird people. The black sheep, the odd ducks, the rejects, the eccentric, the loners, the lost and forgotten. More often than not, these people have the most beautiful souls.

— Unknown

Photo by Sam Field on Unsplash

Outliers, abnormality, anomalies, irregularities, and sundry other terms are prevalent subjects you have undoubtedly encountered at least once in your lifetime. Unquestionably, identifying outliers is one of the most bothersome challenges that any analyst must confront and, in their most fanciful dreams, they would pray to the tooth fairy to relieve them of this despair if they had any baby teeth remaining. Were an analyst to chance upon the magical lamb of Aladdin, he would use the second wish to enable every single database system to detect and remove anomalies automatically. Sadly, I do not possess any milk teeth and you cannot locate the magical lamb, we thus must employ every brain cell to cope with the data abnormality. Nevertheless, despite being a harrowing experience for data analysts, are irregularities truly the trouble?

The black sheep cause trouble, don’t they?

From the statistical viewpoint, abnormality poses a significant challenge in most cases as the majority of statistical methods rely on calculating the mean and variance of data. For instance, the result of a t-test would be flawed if there were a few observations that deviate significantly from the mean of the experimental group, resulting in a higher mean than it should be. Consequently, the analyst might erroneously conclude that there is a difference between the means of two groups, while in reality, there is none. While I won’t delve too deeply into the limitations of the experimental platform when it fails to produce two identical samples for the sake of simplicity, it is essential to employ proper outlier removal techniques to eliminate unwanted records (or mitigate system errors). This ensures that unlikely data points are excluded in the interest of the analyst to avoid incorrect conclusions.

Illustration of the impact of outliers on regression problem, provided by the author

In other circumstances, traditional statistical techniques, like ordinary least squares or maximum likelihood estimation, are vulnerable to anomalies. Because they attempt to estimate a theoretical function that best fits into the given dataset by either minimizing or maximizing the sum of something (squared residuals, log-likelihood, etc.). That said, any “noise” in the data can heavily impact the reliability of the estimation since a few extreme values can significantly alter the “sum of something”. Therefore, assuming outliers are correctly defined, the analyst would want to exclude all anomalies from the given dataset to increase the reliability of the employed estimation method. In both cases, the use of outlier removal techniques aligns with the purpose of the analyst.

What are outliers, to be exact?

Definition

According to Grubbs, an outlier can be defined as follows.

An outlying observation, or “outlier”, is one that appears to deviate markedly from other members of the sample in which it occurs.

“Deviate markedly from other members of the sample” can be interpreted as the outliers being noticeably distant from their neighbourhood through a certain form of measurement. However, the extent and scale at which the data should be considered anomalous remains unclear.

Detection Approaches

Typically, outlier detection involves using the measurement of interest to evaluate the data points in question, also known as univariate detection. Statistics related to central tendency and dispersion, such as mean and standard deviation, are commonly used to determine whether the data points deviate significantly or not. Unfortunately, outliers can significantly impact the formulation of the mean and its derivatives, known as the masking effect, leading to reduced accuracy in detection techniques.

In the case of multiple measurements (multivariate), the relationship between variables is taken into account to identify outliers, in which the distance calculations are amplified. In particular, the more variables considered, the larger the combined distance calculated (refer to the formula below).

Where:

  • d(A, B) denotes the Euclidean distance in two-dimensional space.
  • d’(A, B) denotes the distance calculation in one-dimensional space.
d >= d’

Mathematically speaking, distances computed in higher dimensional space invariably exceed or are at least equivalent to distances in lower dimensions.

It is noteworthy that the selection of variables for assessment — the context — is critical in this case, as different blends of variables can lead to varying and even contradictory conclusions. As referenced in the “Why do outliers exist?” section, unlikely observations are deemed anomalies solely about a particular mix of selected variables and might be completely normal in other subsets of variables. Consider the illustration provided below.

Classification of shop owners, provided by the author

As depicted in the graph, a solitary shop owner is classified as a “Controller” using the K-mean algorithm. The classification is established by considering the number of timer bump services and the total orders purchased, which manifests a substantial deviation from the shop’s surroundings. Nonetheless, if we only assess the total number of orders purchased, the behaviour of this individual is entirely ordinary when compared to other observations, at around 90 orders. Conversely, another shop owner who orders over 600 purchases should be considered an outlier. Thus, the context plays a vital role in the case of multivariate outlier detection.

Types of Anomalies

There are three types of abnormality in data as follows.

  1. Point Anomalies.
  2. Contextual Anomalies.
  3. Collective Anomalies.Anomalies

Point Outliers

Observations that exhibit anomalous behaviour in comparison to the remainder of the sample are classified as Point Outliers (a.k.a. Global Outliers). This is the most frequently observed and straightforward form of abnormality; therefore, this type of outlier can be detected by both univariate and multivariate detection methodologies.

Is it common for a student to buy a piano?

An excellent illustration of such behaviour is financial fraud, wherein the anomalies are discerned based on a solitary metric — the value of a transaction. If a student purchases a piano worth $20,000 while spending only about $1,500 per month, this purchase will be seen as unusual compared to the student’s regular spending habits. It should be noted that the Controller example mentioned earlier also qualifies as a Point Outlier.

Contextual Outliers

When a handful of individuals reveal irregular attributes within a given context, they may be considered contextual outliers (also known as conditional outliers). The distinguishing factor between this type of anomaly and point outliers is the consideration of context. When evaluating a dataset, it is critical to distinguish between two distinct aspects of the data at hand: contextual & behavioural characteristics.

  • Contextual features: The attributes that are utilized to form the background in which the assessed data points are interpreted. An example of a context feature is the location of a retail store, which includes the city, district, and street of the store.
  • Behavioural features: The characteristics describing non-contextual features of the evaluated dataset such as the measure of user behaviour, the classification of products, or the static statuses of retail channels.

In the past, I once worked with FMCG data where the revenue performance of different retail outlets is reported, and I recollect being asked to explain the notable surge in sales for a particular category of biscuits and pies. It was discovered that the revenue spike (a behavioural attribute) was attributed to a single district where a Chinese community was celebrating their traditional festivities (contextual features: district & time). Thus, the revenue increase for the category during that month was an anomaly due to the context of location & seasonality. In real life, conditional outliers have been primarily examined within the domain of time series and spatial data.

Indeed, I must question the relevancy of contextual features in the specific domain of interest on a case-by-case basis to determine the suitability of employing contextual outlier detection techniques. Such techniques are only feasible when the background information is readily accessible and directly aids in defining “normal” scenarios.

Collective Outliers

In the event that a subset of corresponding observations markedly deviates in distance from the remaining population, it is classified as Collective Outlier. In this case, an individual data point within the subset may not be deemed an outlier, however, a collection of anomalous data points that exhibit distinct behaviour from other groups is considered a collective anomaly. It is common to tackle collective outliers in graph data, sequence data, and spatial data.

We can effortlessly address collective outliers in time series data. For instance, upon the introduction of a cockroach during an ECG examination, a patient who is measured is certainly frightened. The ECG waveform will then display a prominent peak, thereby rendering the data instances during this period as collective outliers.

Illustration of intracoronary ECG

A Quick Recap

In summary, the table below presents important characteristics of each type of outlier.

Summary of each type of outlier

I trust that this article has provided a glimpse into the essence of outliers and has facilitated a comprehensive understanding of the characteristics of various types of outliers. I hope that it has also instilled a greater sense of courage in you to pursue a career in data.

References

  1. Debruyne, M., et al. (2018). “Outlyingness: why do outliers lie out?”(link)
  2. Singh, K., Upadhyaya, S., (2012).“Outlier Detection: Applications And Technique” (link)
  3. Grubbs, F. E. (1969). “Procedures for detecting outlying observations in samples”

--

--