Data Cleaning 101 in SQL — #1 Tidying Messy Data
Data cleaning has always been a nightmare that every single analyst has to walk through (like through fire and flames). Although it is a hard pill to swallow, we cannot simply skip this step since the avoidance of data cleaning only brings more miseries to the analyst.
This article aims to help you understand the concept of a tidy dataset and tries to provide a structural guideline for data cleaning for beginners.
1. The shape of the data
On a scale of one to ten, how would you rate your understanding regarding a “messy” dataset? Based on what criteria would you assess a dataset is not “good” enough? And does the untidy data really matter? If you had ever asked yourself those questions, congratulation, you did the very first step toward understanding the importance of data cleaning. Let’s talk about why messy data does matter.
a. Examples of messy data
A good doctor must learn how to diagnose the symptoms of his patients, likewise, a good analyst should know what a messy dataset looks like. Let’s take a look at the following snapshots.
If you are familiar with Microsoft Excel, you may notice some of the little green triangles on the top left of the cells shown in the sample data of sales contact. Those triangles indicate that the cells’ “numbers” are stored as text rather than in number format. And that is the reason why those “numbers” are aligned to the left while some other numbers, saved as actual numbers, are aligned to the right.
Another problem that you might see in the sample data of Firebase is missing values marked as null. And if you have an eye for errors, or are simply OCD, you will notice that the event.int_value
looks quite suspicious because the values are distributed strangely, from negative millions to a few thousand, or 0/1 in some rows. Well, there is indeed nothing technically wrong with this column; however, it is still considered “dirty” data and I will explain later.
In the last example, the “��” characters - the replacement characters used to replace unknown, unrecognized, or unrepresentable characters - in the first line may catch your eye.
These above examples are just business as usual towards data analysts and they are only the simplest examples of messy data.
b. The consequences of dirty data
SAS has identified 7 key steps in the business analytics flow including business problem and data source identification, data cleansing, analysis and insight extraction, and interpretation/deployment of insights. In the data analytics process, Data Scientists state that a substantial 60% of their time is devoted to the task of cleaning and organizing the available data (source). It is worth noting that dealing with dirty data is an inescapable real-world issue that must be addressed on a daily basis.
Utilizing disorganized data inevitably results in irreversible consequences:
- Misleading business insights may be derived from improperly collected and unrefined data, resulting in biased or erroneous conclusions. For instance, you may infer that the majority of male users prefer animated content to static content, based on a dataset that contains 50% null values in
sex
data. However, this conclusion may be fundamentally flawed if the remaining 50% of missing data were male users who express a stronger preference for static content. - Flawed business decisions certainly ensue after being based on misleading insights. Let’s delve deeper into the previous example: if the management team decides to increase the frequency cap for animated banners, the loss of conversion from male user group will be foreseeable.
- The lack of productivity is yet another issue with unrefined raw data. The greater the level of untidiness, the more time an analyst must allocate towards data cleansing. A noteworthy example is the proportion of missing values in a specific dataset. A dataset with only 10% missing value demands significantly less time for imputation than one with 60% null data.
- Ultimately, the dirty data will lead to dissatisfaction from business & management teams because of prolonged serving time and erroneous insights.
c. Raw Data Exploration
Diving and exploring the raw data is necessary for an analyst to comprehend and detect any anomalies present. Allow me to introduce a few simple steps that can aid in gaining a better understanding of a dataset.
Step 1: Familiarize with the structure of the raw data
Nowadays, almost all Database Management Systems (DBMS) provide data schema of every table within. By examining the data schema, you can identify the data type of each column, check if they are nullable, learn the column name and the information they possess.
Step 2: Sense the information contained in the data using preview
You can also choose the preview option to have a glimpse of the real values of data. By cross-referencing the data type information from the table schema, you can determine whether any columns require type conversion. Additionally, you may anticipate string data processing in some other columns via previewing text data.
Step 3: Summarization of numerical data
If available, a summary of numerical columns can be conducted to gain an understanding of the distribution and any missing data issues within those variables. Subsequently, you may decide whether to eliminate, impute or remove outliers from the dataset. Below is the sample code for data summarization.
SELECT
col,
COUNT(value) AS cnt,
APPROX_QUANTILES(value, 100)[OFFSET(25)] AS q1,
APPROX_QUANTILES(value, 100)[OFFSET(50)] AS q2,
APPROX_QUANTILES(value, 100)[OFFSET(75)] AS q3,
AVG(value) AS mean,
STDDEV_POP(value) AS std,
MIN(value) AS min,
MAX(value) AS max,
FROM your_table
UNPIVOT(value FOR col IN (
column_1, column_2, column_3 -- numerical columns
))
WHERE condition_1 -- any filtering if needed
AND condition_2
GROUP BY 1
Step 4: Summarization of categorical data
Likewise, a glance at text columns is also needed for anticipating the missing value problem.
SELECT
col,
COUNT(value) AS cnt,
COUNT(DISTINCT value) AS unq_val,
MIN(LENGTH(value)) AS min_length,
MAX(LENGTH(value)) AS max_length,
FROM your_table
UNPIVOT(value FOR col IN (
column_1, column_2, column_3 -- string columns
))
WHERE condition_1 -- your filtering
AND condition_2
GROUP BY 1
Based on the above information, you may pick some interesting columns and summarize the frequency of each value within them.
d. Visualization Method — Histogram
Using the following code, you can generate a summarized data of a single numerical column for visualizing a histogram.
WITH raw AS (
SELECT
numeric_column -- CHANGE HERE
FROM your_table
WHERE conditions
)
, stats AS (
SELECT
min+step*i AS min,
min+step*(i+1) AS max
FROM (
SELECT
max-min AS diff,
min,
max,
(max-min)/10 AS step, -- CHANGE BINS (10) HERE
GENERATE_ARRAY(0, 10, 1) AS i -- CHANGE BINS (10) HERE
FROM (
SELECT
MIN(numeric_column) AS min, -- CHANGE HERE
MAX(numeric_column) AS max -- CHANGE HERE
FROM raw
)
), UNNEST(i) i
)
SELECT
COUNT(1) AS cnt,
(min+max)/2 AS avg
FROM raw r
INNER JOIN stats s
ON r.numeric_column >= s.min -- CHANGE HERE
AND r.numeric_column < s.max -- CHANGE HERE
GROUP BY avg
ORDER BY avg
e. Visualization Method — Contingency table
For categorical columns, you simply need to count the frequency of the (unique) values according to your preference.
SELECT
column_1,
column_2,
COUNT(1) AS freq
FROM your_table
WHERE conditions
GROUP BY column_1,column_2
2. Tidying messy data
a. Symptoms of messy data
As a data analyst, you may tackle one of those commonly seen “errors” in your life. Needless to say, there are countless other types of dirty data as you gradually expose to different kinds of data. To sum up, here is the list of common symptoms of a messy dataset:
- Missing data: when not all rows/records of the dataset contain information. In many databases, the missing data are usually assigned a special value — `NULL` value.
- Incorrect data type: the type of stored data does not reflect the characteristics of the information. For example, the sales amount, which should be numbered, is stored as text. In some worse scenarios, you may see the array/dictionary-type data stored as strings. The worst case I have ever seen is the literal string “null” is used to represent missing values.
- Erroneous/Special characters in a string column: this is technically not wrong but you have to process those characters. A good example is the emoji characters within the SEO’ content data, like ✌️💓🎁. Another example is wrongly injected CSS/HTML tags or placeholders such as <div>, <strong>, or ${firstname}.
- Extra “space” in text data: sometimes there are redundant whitespace characters “
<space>
” introduced to a string-type column due to defective tracking or input process, like “Adam<space>
<space>
Levine”, or “<space>
Jone<space>
Smith”. - Inconsistent string values: well this is personally the most annoying problem to me when I see different values indicating the same meaning, i.e. “Male”, “m.”, “MALE”, “M”. This type of error is usually caused by the combination of data from multiple data sources that track the same attribute or the collection of manually input data.
- DateTime data is stored in
POSIXct
format: this is not necessarily an error but you have to deal with it anyway. ThePOSIXct
stores date and time as the number of seconds since the beginning of January 1, 1970, 00:00. For example, “1445271300” is the date of October 9, 2015.
b. Definition of Tidy Data
According to Hadley Wickham, tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Other than the mentioned errors, you may be considered the following scenarios as messy data as well, as per the concept of tidy data:
- Columns in the dataset are values of a variable: for example, the income of different households over the years is recorded in multiple columns like
2013
,2014
,2015
,2016
, and so on. Indeed, those columns can be stored in a single column namedyear
. - Variables are stored in both rows & columns: you could see this symptom in the tracking data of Firebase or Google Analytics, where a key-value pair (
event_params.key
&event_params.value
) records multiple attributes of users. For example,utm_campaign
,utm_medium
&utm_source
in theevent_params.key
column store the campaign name, the medium & the source of the marketing channel that a user clicks on, respectively. - Multiple variables are stored in the same column: this symptom usually appears in datasets having demographic information. The data distributor may combine multiple demographic characteristics and store in multiple columns. For example, the global tuberculosis data provided by World Health Organization combines
sex
(m,f) andage group
(0–4, 5–14, 15–24, 25–34,…) into columns likenew.sp.m04
,new.sp.f1524
,new.sp.m3544
.
c. Clean up the mess
Case 1: Column headers are values, not variable names
When column headers are indeed values of a variable, it is called wide format table and we need to convert it back to a standard tidy dataset by using the pivot/unpivot function.
Case 2: Multiple variables stored in one column
Usually, the messy column is stored as string, we thus can use some string functions to separate it.
- LEFT or RIGHT if the length of 1 or more variables are fixed (sex has length of 1).
- REGEXP_EXTRACT if the needed variables follow some patterns.
- SPLIT + OFFSET if there are delimiters within the messy column.
In some other cases, the messy column is stored as array or struct type like dictionary, json, etc. :
- OFFSET if it is stored as an array.
- JSON_EXTRACT* if it is stored in json format.
- UNNEST or .key (dot notation) if it is stored natively as struct type of bigquery.
Case 3: Variables are stored in both rows & columns
In this case, the dataset is called long format table and we can also use pivot operator to convert it back to the tidy standard.
FROM your_table
PIVOT(
ANY_VALUE(value) AS key
-- filter only concerned variables, else remove this
FOR element IN('tmin','tmax')
-- add more variable name (tmean, tstd) if needed
)
Case 4: Multiple variables stored in one column — Advanced
DECLARE lower_date DATE DEFAULT DATE '2021-01-01';
DECLARE upper_date DATE DEFAULT DATE '2021-01-10';
WITH raw_fib AS (
SELECT DISTINCT
event_timestamp,
event_name,
user_id,
key,
value.string_value AS str_val,
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20*`
, UNNEST(event_params) AS event_params
WHERE TRUE
AND REGEXP_CONTAINS(event_name, "page_view|scroll|user_engagement|first_visit")
AND _TABLE_SUFFIX >= FORMAT_DATE("%y%m%d", lower_date)
AND _TABLE_SUFFIX <= FORMAT_DATE("%y%m%d", upper_date)
)
, usr_data AS (
SELECT
DATE(event_time) AS event_date,
event_time,
user_id,
segment,
session_id,
FROM (
SELECT DISTINCT
DATETIME(TIMESTAMP_MICROS(event_timestamp), "+7") AS event_time,
user_id,
-- CASE WHEN TO CREATE SEGMENT
CASE
WHEN REGEXP_CONTAINS(event_name, 'your_event_name')
AND pr_page_title = 'your_action_value' THEN 'segment_name'
ELSE 'your_value'
END AS segment,
CONCAT(pr_ga_session_id, user_id) AS session_id,
FROM raw_fib
PIVOT(
ANY_VALUE(str_val) AS pr
-- add more param (stage, screen) if needed
-- each key listed below will be pivoted as pr_<key_name>
FOR key IN('page_title', 'ga_session_id', 'page_location')
)
)
)
SELECT
event_date,
COUNT(DISTINCT session_id) AS session_count,
COUNT(DISTINCT user_id) AS user_count
FROM usr_data
GROUP BY 1DECLARE lower_date DATE DEFAULT DATE '2021-01-01';
DECLARE upper_date DATE DEFAULT DATE '2021-01-10';
WITH raw_fib AS (
SELECT
event_timestamp,
event_name,
event_params,
user_id
FROM `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20*`
-- UNNEST(event_params) AS event_params
WHERE TRUE
AND REGEXP_CONTAINS(event_name, "page_view|scroll|user_engagement|first_visit")
AND _TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", lower_date)
AND _TABLE_SUFFIX <= FORMAT_DATE("%Y%m%d", upper_date)
)
, usr_data AS (
SELECT
DATE(event_time) AS event_date,
event_time,
user_id,
segment,
session_id,
FROM (
SELECT DISTINCT
DATETIME(TIMESTAMP_MICROS(event_timestamp), "+7") AS event_time,
user_id,
-- CASE WHEN TO CREATE SEGMENT
CASE
WHEN REGEXP_CONTAINS(event_name, 'your_event_name')
AND key_action = 'your_action_value' THEN 'segment_name'
ELSE 'your_value'
END AS segment,
CONCAT(key_ga_session_id, user_id) AS session_id,
FROM clause
PIVOT(
MAX(str_val) AS key
# add more param (stage, screen) if needed
FOR key IN('action', 'ga_session_id')
)
-- your filter here
WHERE TRUE
AND key_action IN('action_value_here')
AND event_name IN('event_name_here')
)
)
SELECT
event_date,
COUNT(DISTINCT session_id) AS session_count,
COUNT(DISTINCT user_id) AS user_count
FROM usr_data
GROUP BY 1
d. Data Type Treatment
Case 1: Type Casting
In some cases, you may need convert float-type/numerical-type data into integer-type data. Or you need to forcefully convert integer data into string. Just use the CAST
function.
CAST(AMOUNT AS INT64) AS amt,
CAST(user_id AS STRING) AS user_i
Case 2: String processing
The TRIM
function can be utilized to remove the unwanted characters like the extra space <space>
, double quotes "
, underscore _
, etc.
SELECT
first_name,
TRIM(first_name," ") AS proc_first_name
FROM your_table
If you happen to work with phone numbers, postal codes, or any data in a “code”/”id” format, you may come across some “short” rows that lack the starting character, such as “0” or “us_”. This can result in inconsistent length of the data.
SELECT DISTINCT
POSTAL_CODE AS org_code,
LPAD(POSTAL_CODE, 12, "0") AS pad_code -- or RPAD in some cases
FROM your_data
WHERE POSTAL_CODE IS NOT NULL
Case 3: Array-like data
It is common that array-like information can be registered as string-type columns. We can utilize the JSON_EXTRACT_ARRAY
function to collect items within the arrays.
SELECT
PRODUCTID AS org_product_id,
JSON_EXTRACT_ARRAY(PRODUCTID, "$") AS product_id
FROM your_table
Sometimes, the string data are not written correctly either and we need to remove the erroneous characters (usually double quotes "
).
SELECT
NOTE AS org_note,
JSON_EXTRACT(NOTE, '$.info') AS js_ext_note,
TRIM(JSON_EXTRACT(NOTE, '$.info'),'""') AS trim_ext_note
FROM your_table
Case 4: DateTime data processing
To optimize storage, many tracking systems prefer using POSIXct format data. As a result, it is necessary to convert these numbers into a human-readable date and time format.
SELECT DISTINCT
REPORT_DATE AS org_date,
TIMESTAMP_MILLIS(CAST(REPORT_DATE AS INT64)) AS datetime_date,
DATE(
TIMESTAMP_MILLIS(CAST(REPORT_DATE AS INT64)),
"Asia/Bangkok") AS date
FROM your_table
Or we may need to convert string-format timestamp expression into date-type data in some cases.
SELECT DISTINCT
ISOWEEK_YEAR_CHAR AS org_wy_char,
PARSE_DATE("%W-%Y", ISOWEEK_YEAR_CHAR) AS converted_wy
FROM your_table
You can find the abbreviation of date part format code here.
Update:
You may find other datetime functions here. For example, the FORMAT_DATE/FORMAT_TIMESTAMP
function is extremely useful.
SELECT DISTINCT
FORMAT_TIMESTAMP('%b %Y', TIMESTAMP '2008-12-25 15:30:00+00') AS fm_ts,
FORMAT_DATE('%m.%d', DATE '2008-12-25') AS fm_date
FROM your_table
Well, there are two more critical sections concerning Missing Data Handling and Outlier Removal which will be covered in the next part of this series. I hope that this article may help you have an overview of common types of messy data and some simple handling methods. Cheer! 🎉🎉🎉