Data Cleaning 101 in SQL — #2 Dealing with Missing Data

9 min readJul 16, 2023

When you are missing someone, time seems to move slower, and when I’m falling in love with someone, time seems to be moving faster.
— Taylor Swift

Once you realize that your dataset has missing values, you will comprehend the considerable amount of time that you may spend dealing with it and how slow the time may go.

In the previous tutorial, I introduced what a messy dataset looks like, the consequences of dirty data, how to explore & get to know a dataset, the concept of tidy data, and some fundamental data processing techniques.

In this article, I would try to walk you through one of the two most common & troublesome data problems: Missing Data.

3. Missing Data Handling

a. Type of Missing Data

You have doubtlessly heard about the term “missing value” and definitely confronted datasets where a few or thousands of records are null, regardless of whether you are a novice or an experienced analyst. As a wise man once said, “… in this world nothing can be said to be certain, except death and taxes”. I would like to interject and include “and missing data as well as outliers”. Look at the following examples, which hopefully recall some painful memories within you ( ´ ꒳ ` ).

Public Firebase Tracking Dataset (here), indeed these may not necessarily be actual missing values

Public Google Analytics Tracking Dataset (here)

According to Tony Fischetti, there are three types of missing data which you may encounter while working.

Missing Completely At Random (MCAR):
Missing Completely At Random, MCAR, occurs when data’s missingness is unrelated to the data. This would occur, for example, if rows were deleted from a database at random, or if a gust of wind took a random sample of a surveyor’s survey forms off into the horizon.
Missing At Random (MAR):
Unfortunately, Missing at random is somewhat misleading. In spite of its name, it implies that there exists a systematic relationship between the absence of an outcome variable and other observed variables, but not the outcome variable itself. Let’s take the climate data as an example, in which we collect humidity, temperature, and CO2. MAR happens when the humidity and CO2 data is absent, particularly below -10 degrees Celsius (recorded in temperature variable), due to the freezing of sensors, rendering them incapable of functioning properly. Despite the extreme conditions, these sensors can accurately detect temperature variations even at sub-zero temperatures below -100 degrees Celsius.
Missing Not At Random (MNAR):
MNAR, conversely, occurs when the absence of a variable is correlated with the variable itself. For instance, consider a scenario where the weighing scale can only measure up to 3,700 pounds and, as a result, the eight cars that exceeded this threshold were recorded as NULLs.

b. Dealing with Missing Data

Analysis of missing values

First of all, you need to summarize how many NULL values are in your dataset. This can be easily achieved in common programming languages through the use of well-developed methods like .describe() or summary(). In SQL-like language such as BigQuery, you should work around a bit.

WITH tbl AS (
  SELECT 
    -- your columns here
    repository.created_at,
    repository.name,
    repository.description,
    repository.url,
    repository.master_branch,
    repository.integrate_branch,
    repository.watchers
  FROM `bigquery-public-data.samples.github_nested` -- replace with your table
)

SELECT
  null_column,
  COUNT(1) AS null_count,
  ROUND((COUNT(1)/(SELECT COUNT(1) FROM tbl))*100,2) AS percent_null
FROM tbl,
UNNEST(REGEXP_EXTRACT_ALL(
  TO_JSON_STRING(tbl),
  r'\"([a-zA-Z0-9\_]+)\":null'
)) AS null_column
GROUP BY 1

Analysis of missing values in the given dataset

Upon initial inspection, it appears that you cannot incorporate the master_branch and integrate_branch in your work due to a significant proportion of missing data. Once you acquire a general understanding of the missingness in your dataset, you can employ one of the following techniques to address the missing data issue.

Method 1: Completely Ignore Nulls

Frankly speaking, you just need to remove all records comprising of missing values in any of the extracted variables or columns. This method is generally known as complete case analysis, a.k.a. list-wise deletion, as you only keep non-null values in your analysis. Well, sometimes less is more I guess.

WITH tbl AS (
  SELECT  
    repository.created_at,
    repository.name,
    repository.description,
    repository.url,
    repository.master_branch,
    repository.integrate_branch,
    repository.watchers
  FROM `bigquery-public-data.samples.github_nested` 
)

SELECT  
  created_at,
  name,
  description,
  url,
  watchers
FROM tbl
-- filter out null values in all columns
-- CONCAT will return NULL if any of the expr is null
WHERE CONCAT(created_at,name,description,url, CAST(watchers AS STRING)) IS NOT NULL 
-- or you can filter each column
-- WHERE created_at IS NOT NULL
--   AND name IS NOT NULL
--   AND description IS NOT NULL
--   AND url IS NOT NULL
--   AND watchers IS NOT NULL

Upon the implementation of this procedure, you are essentially creating a subset of the provided dataset. In other words, you are drawing a sample from the (possible) population of the interested observations. Undoubtedly, this comes at a cost, which varies depending on the type of missing issue you are tackling, as shown as follows.

Summary of list-wise deletion method’s statistical consequences

Method 2: Partially Ignore Nulls

This methodology is frequently employed when estimating covariance or correlation matrices, requiring the presence of at least one pair of numerical variables that are non-null. That said, it is only necessary to retain records where there exists at least two variables not to be null, rather than all of them simultaneously. This technique is commonly referred to as available-case analysis or pairwise deletion. As far as I know, there should be a wide range of analysis of variance (ANOVA) techniques requiring concerned variables to be non-null.

It is noteworthy that pairwise deletion may result in ridiculous correlation coefficients (above 1) due to the varying number of inputs for calculation across cells of covariance/correlation matrices.

-- for the purpose of illustration, I will cast null some values in forks,
-- size and open_issues 
WITH tbl AS (
  SELECT  
    -- repository.created_at,
    repository.name,
    repository.description,
    IF(repository.forks<50,NULL,repository.forks) AS forks,
    repository.watchers, 
    IF(repository.size<200,NULL,repository.size) AS size,
    IF(repository.open_issues=0,NULL,repository.open_issues) AS open_issues
  FROM `bigquery-public-data.samples.github_nested` 
)

SELECT  
  *
FROM tbl
-- keep records where there is at least one non-null pair
WHERE COALESCE(
  CONCAT(forks,watchers),
  CONCAT(watchers, size),
  CONCAT(size,open_issues),
  CONCAT(open_issues,forks)) IS NOT NULL

Method 3: Mean Substitution

This approach is relatively straightforward, where all NULL values are substituted with the mean of the variables of interest. In certain instances, median or mode can be utilized as substitutions for missing data.

WITH tbl AS (
  SELECT 
    *,
    -- calculate means
    AVG(forks) OVER() AS mean_forks,
    AVG(size) OVER() AS mean_size,
    AVG(open_issues) OVER() AS mean_open_issues
  FROM (
    SELECT  
      -- repository.created_at,
      repository.name,
      repository.description,
      IF(repository.forks<50,NULL,repository.forks) AS forks,
      -- repository.watchers, 
      IF(repository.size<200,NULL,repository.size) AS size,
      IF(repository.open_issues=0,NULL,repository.open_issues) AS open_issues
    FROM `bigquery-public-data.samples.github_nested` 
  )
)

SELECT  
  name,
  description,
  COALESCE(forks, mean_forks) AS forks,
  COALESCE(size, mean_size) AS size,
  COALESCE(open_issues, mean_open_issues) AS open_issues
FROM tbl

As you may see, this method appears to address the issue of sample size reduction in the previously mentioned techniques, thereby preserving statistical power. Nevertheless, mean substitution yields biased estimates (smaller) of the variance in the case of MCAR due to its disregard for the original dispersion that would have been presented had the data not been missing. Consequently, despite its introduction here, I would not recommend this technique under any circumstances. The statistical implications are summarized below.

Changes in statistics after implementation of mean substitution

Method 4: Hot Deck Imputation

Briefly speaking, this procedure replaces missing values with existing values taken from other records in the dataset. The basic approach involves using a random value, of course non-null, from the same column to replace a missing value. In more sophisticated approaches, the replaced value is sourced from a row that bears similarity to the row with the missing data.

Indeed, I am introducing this method to familiarize you with the subsequent method — regression imputation.

WITH tbl AS (
  SELECT 
    *,
    ROW_NUMBER() OVER() AS rn,
    ROW_NUMBER() OVER(ORDER BY RAND()) AS rn_shfl,
  FROM (
    SELECT  
      -- repository.created_at,
      repository.name,
      repository.description,
      IF(repository.forks<50,NULL,repository.forks) AS forks,
      -- repository.watchers, 
      IF(repository.size<200,NULL,repository.size) AS size,
      IF(repository.open_issues=0,NULL,repository.open_issues) AS open_issues
    FROM `bigquery-public-data.samples.github_nested` 
  )
)

SELECT
  name,
  description,
  COALESCE(forks, imp_forks)             AS forks,
  forks                                  AS org_forks,        -- ignore it
  COALESCE(size, imp_size)               AS size,
  size                                   AS org_size,         -- ignore it
  COALESCE(open_issues, imp_open_issues) AS open_issues,
  open_issues                            AS org_open_issues,  -- ignore it
FROM (
  SELECT  
    t2.*,
    -- replace columns you need to impute here
    LAST_VALUE(t2.forks IGNORE NULLS) OVER (
      ORDER BY t2.rn_shfl RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS imp_forks,
    LAST_VALUE(t2.size IGNORE NULLS) OVER (
      ORDER BY t2.rn_shfl RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS imp_size,
    LAST_VALUE(t2.open_issues IGNORE NULLS) OVER (
      ORDER BY t2.rn_shfl RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS imp_open_issues,
  FROM tbl t1
  -- to shuffle the dataset randomly
  JOIN tbl t2 ON t1.rn = t2.rn_shfl
)
ORDER BY rn

Note: please be aware that there may be a single row that remains null since we use LAST_VALUE() function with the RANGE BETWEEN UNBOUNDED PRECEDING setup, which will return null value if the first row is null.

Method 5: Regression Imputation

Although it may be challenging to apply this technique to SQL-like databases, I would like to introduce the general concept here nonetheless, as it may prove useful in circumstances where a programming language like Python can be utilized. As its name may suggest, this method employs a regression model to predict replaced values of missing items, using other existing columns as independent variables. Let’s consider our beloved Github Dataset (available here) as an example, the missing values of the forks column can be estimated based on the size and watchers columns, sound promising, does not it?

By utilizing information from other columns and patterns of the dataset, this technique generates more conversant replaced values. As long as the predictors are included, the method creates unbiased estimates of the mean and regression coefficients in the case of MCAR & MAR. However, the predicted values are lack of natural residuals from the regression line, which leads to the underestimation of variability of the missing values.

Summary of statistics after applying regression imputation

The following chart illustrates the behavior of different imputation procedure.

Demonstration of different imputation techniques (ref)

As implied by the above diagram, a more effective approach to regression imputation involves incorporating conditional noise into the prediction of the missing values, known as Stochastic Regression Imputation. Additionally, there is a more appealing and sophisticated approach called Multiple Imputation whose result is outstanding. However, I could not introduce these two methods here due to the limitation of SQL itself.

Well, there are one last important section regarding Outlier Removal which will be covered in the next part of this series. I hope that this article may help you have an overview of missing data problem and some simple handling methods. Cheer! 🎉🎉🎉

References

Missing Data: Two Big Problems with Mean Imputation: https://www.theanalysisfactor.com/mean-imputation/
Mean substitution — skewness and kurtosis: https://stats.stackexchange.com/questions/364711/why-and-how-does-adding-an-interaction-term-affects-the-confidence-interval-of-a
Dealing with missing data in ANOVA models: https://www.r-bloggers.com/2018/06/dealing-with-missing-data-in-anova-models/
Standard error: https://en.wikipedia.org/wiki/Standard_error
Hot deck imputation preserve the distribution of the item values: https://stats.stackexchange.com/questions/307339/hot-deck-imputation-it-preserves-the-distribution-of-the-item-values-how-c

Data Cleaning 101 in SQL — #2 Dealing with Missing Data

3. Missing Data Handling

a. Type of Missing Data

b. Dealing with Missing Data

References

Written by Brian Tran