Data Cleaning 101 in SQL — # 3.2 The Origin of Outliers & Detection Techniques
Everything that exists has an explanation for its existence
The larger the volume of the dataset we are handling, the higher the probability of encountering flawed values arising from measurement inaccuracies, data input errors, or similar sources.
Why do outliers exist?
Mother Nature is bountiful to all species, which is the reason why genetic variation exists and so do the outliers in actual data. As the sample size or the number of variables increases, there is a high probability that some unlikely observations in certain variables will behave quite differently from the remaining sample. In other words, a few instances of observed subjects may be outlying in some specific measurements but can be entirely normal for other variables. Even though we should acknowledge the existence of irregularity in the data, we must closely inspect whether they are natural extreme instances or products of some kind of error since the follow-up courses of action differ for each case.
The odds of nature
Let’s consider the disease diagnosis problem as an example. A patient can be accurately diagnosed with breast cancer if a particular set of diagnostic measurements behaves differently from other regular patients. Indeed, human beings may encounter any possible disease and illness that naturally results in irregular measures of health, such as high blood pressure, abnormal heartbeats, high cholesterol, and the like. Therefore, the existence of odd individuals is completely natural and identifying irregular observations, which are now influential data points, through appropriate statistics is of the utmost importance in this case.
Likewise, there exist various real-life issues like financial fraud detection, cheating detection, insider trading detection, disease prediction, defect product identification, and so forth that are essentially outlier detection problems. However, in this article, I will only discuss simple outlier detection procedures and leave advanced topics for later writing.
The unwanted product
On the contrary, the occurrence of abnormality resulting from human errors or measurement errors is the least expected scenario. There is a wide range of errors leading to invalid anomalies, including:
- Data entry errors: The data inputted by humans is misrecorded such as misplaced decimal points or the extra zero at the end.
- Defective measuring equipment: Extreme records that are erroneously produced due to the deficiency of the tracking devices, sensors or scales. For example, an unclean light sensor may overly underestimate the level of illuminance.
- Improper measurement procedure: This type of error occurs when the subjects of interest experience inappropriate measuring conditions. Let’s say you have 10 patients waiting for you to measure their blood pressure, and one patient just goes running for a kilometre or two, his blood pressure will be higher than the rest of the patients.
- Erroneous sampling process: Inferential statistics employ samples to make inferences about a particular population and it is possible to inadvertently procure objects or entities that do not belong to the intended population. For instance, the researcher measures the item or subject under abnormal circumstances that diverge the sample from the specified population. In other circumstances, you may unintentionally obtain items that lie beyond the scope of the target population, and, consequently, they may possess peculiar qualities.
- Novel patterns: Sometimes, emergent behaviours that have not previously been observed come to the fore as time progresses. As a result, these newly introduced data points are deemed outliers to the existing dataset. The change in the stock price of a newly established company is a good example.
- Unwanted events: In certain circumstances, an unpredictable event that occurs while executing the measurement will lead to outlying records. For instance, if a patient is astonished by the presence of a cockroach, the record of ECG will then be an outlier.
This should prove adequate for the preliminary exposition of outlier definition, categorization, consequences, and consideration when carrying out outlier removal methods. The subsequent section will demonstrate a range of commonly utilized outlier detection techniques.
Basic Methods for Identifying Outliers
1. Z-score Method
Advice: This methodology is suitable for a dataset that conforms to a normal distribution.
You may recognize this image as the illustration of the Gauss Distribution or Normal Distribution. If a dataset falls into this type of distribution, we can calculate various statistics regarding the dispersion of the data (or the distance we continuously mention in this article), such as standard deviation, variance, T-score or Z-score. As the name may suggest, we shall employ the Z-score to identify unlikely observations in the dataset at hand. This approach is based on the 68–95–99.7 rule (so-called three-sigma rule), which states that only 0.3% of the data falls outside the range of +/- 3 sigma from the mean in a normally distributed sample. That said, only 0.3% of the data deviates noticeably from the remainder, which can be considered outliers.
Pros
- Given a Gaussian distribution, this is a remarkably efficient method thanks to its simplicity.
- It is exceedingly effortless to execute this approach on either an SQL-like database or a local machine using any programming language.
Cons
- The presence of anomalies produces less pronounced Z-scores (the masking effect), and therefore, the outcome becomes less reliable.
- This technique is exclusively suitable for application in a feature space of low dimensions and the context of a dataset of small to medium size.
- It is not advisable to employ this method when it is not feasible to assume that the distributions are parametric.
The following code block may assist you in this regard.
SELECT
repository.created_at,
repository.name,
repository.description,
repository.forks AS forks,
(repository.forks - AVG(repository.forks) OVER())/(STDDEV(repository.forks) OVER()) AS fork_zscore,
(repository.watchers - AVG(repository.watchers) OVER())/(STDDEV(repository.watchers) OVER()) AS watcher_zscore,
FROM `bigquery-public-data.samples.github_nested`
Or a more dynamic approach:
WITH stats AS (
SELECT * FROM(
SELECT
col,
AVG(value) AS mean,
STDDEV(value) AS std,
FROM (
SELECT
repository.forks, repository.watchers, repository.size
FROM `bigquery-public-data.samples.github_nested` ) -- because the repository is a struct-type column, you can select a table directly here
UNPIVOT(value FOR col IN (
forks, watchers, size -- input list of columns here
))
GROUP BY 1)
PIVOT (
AVG(mean) AS mean,
AVG(std) AS std
FOR col IN('forks', 'watchers', 'size') -- input list of columns with quote here
))
SELECT
repository.created_at,
repository.name,
repository.description,
(repository.forks - mean_forks)/std_forks AS fork_zscore,
(repository.watchers - mean_watchers)/std_watchers AS watcher_zscore,
(repository.size - mean_size)/std_size AS size_zscore
FROM `bigquery-public-data.samples.github_nested`
, stats
Once you have computed the Z-score for each metric, you may opt to eliminate any anomalies whose Z-scores are either greater than or equal to 3, or less than or equal to -3. It is noteworthy that this approach may be misleading when applied to a noticeably small dataset since the maximum Z-score is confined to (n-1)/sqrt(n).
In reality, I seldom come across normal distributions in my working practices since most quantifiable performance metrics do not encompass negative values and exhibit highly right-skewed distributions. Consequently, I have limited opportunities to employ the Z-score technique for detecting anomalies despite its simplicity.
2. Interquartile Range Rule
Advice: This method is not applicable for certain distributions like binary distributions or Bernoulli distributions.
The underlying concept of this technique bears a resemblance to the Z-score approach. The Interquartile Range Rule (also known as the IQR Rule) declares that any data instances whose values are 1.5 times greater than or 1.5 times less than the IQR, which is the middle 50%
of the dataset, are deemed to be outliers. Indeed, the 1.5 range is referred to as the inner fence and is predominantly used; nevertheless, you can construct the outer fences by multiplying the IQR by 3 to identify anomalies.
Pros
- This technique is advantageous since it employs percentiles, which are not contingent on a particular distribution.
- Percentiles are comparatively resilient to the existence of outliers in contrast to other quantitative approaches.
- Furthermore, this approach is uncomplicated and straightforward to implement, necessitating only fundamental arithmetic computations.
Cons
- The choice between inner or outer fences is contingent on the domain knowledge and expertise of the analyst.
- This method does not furnish a graphical illustration of the data distribution or the first and third quartiles, which are the underlying structure of it.
- This technique might be less efficient in detecting unlikely data points in data with heavy-skewed distributions or ultimately extreme values.
Even though this method avoids the masking effect (as discussed in the “What are Outliers, to be exact?” section) by utilizing Q1 and Q3, which are not significantly influenced by outliers, you should note that this technique may be inappropriate for certain distributions. For instance, the binary data containing only 0 & 1 may result in a flawed detection because Q1 & Q3 can now be 0 & 1. Thus, we may not remove any values or eliminate a substantial proportion of data.
Using the following code may assist you in detecting the anomalies.
WITH stats AS (
SELECT
col,
APPROX_QUANTILES(value, 100)[OFFSET(25)] AS q1,
APPROX_QUANTILES(value, 100)[OFFSET(75)] AS q3,
FROM (
SELECT
repository.forks, repository.watchers, repository.size
FROM `bigquery-public-data.samples.github_nested` )
UNPIVOT(value FOR col IN (
forks, watchers, size
))
GROUP BY 1
)
, iqr_cal AS (
SELECT * FROM (
SELECT
col,
q1 - 1.5*(q3-q1) AS lower_bound,
q3 + 1.5*(q3-q1) AS upper_bound,
FROM stats
)
PIVOT (
AVG(lower_bound) AS lower_bound,
AVG(upper_bound) AS upper_bound
FOR col IN('forks', 'watchers', 'size')
))
SELECT
repository.created_at,
repository.name,
repository.description,
repository.forks,
repository.watchers,
repository.size
FROM `bigquery-public-data.samples.github_nested`
, iqr_cal
WHERE TRUE
AND (repository.forks BETWEEN lower_bound_forks AND upper_bound_forks)
-- AND (repository.watchers BETWEEN lower_bound_watchers AND upper_bound_watchers)
3. Visualization Method
Advice: This technique require a skilled analyst to perform effectively. It is important to have a deep understanding of the data and the ability to identify patterns and insights.
In this technique, we will utilize the histogram or box plot to identify a relatively suspicious threshold in the concerned variable. Subsequently, you can use this particular value to filter out any outliers. To visualize the data in question in the form of a histogram, kindly refer to this tutorial that I have previously written here.
Honestly, I prefer using the graphical approach for outlier detection. It is simple to implement and I can easily explain the data distribution to stakeholders and how probable the outliers are.
Pros
- Histograms provide a graphical illustration of the distribution of data, supporting the analyst to sense the mode, the magnitude of dispersion, and potential outliers.
- Histograms help compare multiple datasets side-by-side (i.e. the Ad Effectiveness graph in the Z-score section) and identify differences in variability, symmetry, and skewness.
- Various BI tools can generate histograms easily.
Cons
- Statistical charts like histograms or box plots may be difficult to interpret for non-technical stakeholders (i.e. business units, operation teams, product teams, etc.) who are unfamiliar with statistics.
- This approach does not confirm exact values as thresholds for identifying unlikely observations but rather denotes their approximate location in the given dataset.
- This method may fail to detect outliers in certain circumstances, especially when the sample size is relatively small or when the data is heavily skewed.
Kindly bear in mind that not all outliers are insignificant, and exercise caution when deciding to eliminate any questionable data instances.
Well, this article is the end of the series. I trust that it has introduced you to a comprehensive understanding of outliers and how to manage them fittingly. Cheer! 🎉🎉🎉