Data Cleaning 101 in SQL — #4.2 A Practical Tutorial for Data Deduplication
I am glad to have you for the fourth part of my complete guide on data cleaning.
- #1: Tidying Messy Data
- #2: Dealing with Missing Data
- #3.1: A discussion on the Nature of Outliers
- #3.2: The Origin of Outliers & Detection Techniques
- #4.1: Where does Data Duplication come from?
- #4.2: A Practical Tutorial for Data Deduplication <– You are here
I am not a duplicate multiple soul, rather a unique rare blooming nurturer, so never compare me with other precious souls.
— DR. ZAINAB ANSARI
Each person is one-of-a-kind, and we should embrace our individuality instead of trying to be a copy of someone else. Similarly, data — a reflection of our wonderfully unique world through the capture of every moment in time — should be unique, too.
Consider a football player maneuvering down the pitch, his movements create a distinct data sequence. Or a shopper browsing online, her shopping journey constitutes a personal digital footprint. At every instant, novel data emerges like snowflakes and no two alike.
It is the beauty of modern technology which reserves these fleeting moments under digital form. By acknowledging the uniqueness of individuals and the world around us, we can ensure that the data we collect is accurate and reflects reality. So in this article, allow me to provide a useful handbook for data deduplication.
How Did We Know?
Before we can start cleaning our data, we need to determine if there are any duplicate records. There are two levels of approaches we can take to detect duplicates: a simple examination of the raw data and a careful inspection of the database schema. The decision of what approaches to employ will depend on the size and complexity of our dataset.
Novice Methods
This approach involves investigating the data itself and manually checking for any duplicates, which includes counting and sorting methods. This can be a time-consuming and computationally costly process, especially when we tackle a large dataset. On the plus side, it is a straightforward way to identify duplicates and can be helpful for smaller datasets.
Easy Count
With an easy-to-go COUNT
function, we can detect the duplicate records in our data at the level of data granularity (level of detail) we are currently working on.
Line Up
We can employ ranking functions (ROW_NUMBER
, RANK
, DENSE_RANK
) to sort the records based on a specific order (i.e. time order, a numerical variable or even no order at all).
In this case, the County_of_Residence
column has more than one ranking, thus it indicates the presence of duplication in data.
General Approaches
This approach involves inspecting the database schema, which is the structure of the database that defines how the data is organized, stored and related to each other. By examining the schema, we can identify any potential sources of duplication, such as multiple tables/columns that contain similar information or a mismatched level of detail when performing a JOIN
clause.
Data Granularity
At a more granular level, it is easy to overlook the fact that the information we require (such as user_id
, event_id
, session_id
, etc.) may exist in multiple records. To prevent this error, we must examine the table schema to determine which column contains the most detailed level of data.
Relationship Check
Dynamic data such as transactions, app events, and backend data require distinct statuses (i.e. confirmed, delivered, cancelled) stored in separate records, which can result in duplication during careless merging processes. Hence, it is essential to verify whether the tables we intend to join have a one-to-many relationship via database design schema.
Size Comparison
To ensure that the final joined dataset is free of duplicates, we can compare the total number of records in the original dataset before and after joining the data.
Beat It
Provided that you possess an awareness of the most granular level of detail that is currently being processed through the employment of the aforementioned General Approaches, you may confidently utilize the subsequent techniques to prevent data duplication.
Method 1: Distinct Selection (Highly Recommended)
To prevent duplication, it is imperative to include the DISTINCT
operator after the SELECT
statement. This will ensure that your data remains pristine and aligned with the specified level of detail.
Method 2: Latest Status Extraction (Recommended)
An alternative approach is to extract the most recent occurrence of the data of interest based on the relevant level of granularity. This can be accomplished by utilizing ranking functions, such as ROW_NUMBER.
This method is particularly useful when you have to deal with slow-changing dimension data such as age, marriage status or device model. The challenge is determining which data instance is accurate, as a single user may use both Samsung and iPhone devices concurrently. However, we must extract a distinct set of attributes from that user and cannot retain both records. Consequently, we must assume that the last used model is the user’s correct mobile device model.
Method 3: Count Before Join
The least favourable approach is to COUNT
the frequency of the dimension of interest prior to executing a JOIN operation. This is due to the fact that modern optimizers in SQL-like databases are highly likely to recognize your intention to select distinct occurrences of data and provide a distinct selection query plan instead of counting the data. Therefore, it is advisable to apply the distinct selection method in the first place.
Well, this is the end of Data Cleaning in SQL series and I hope that you enjoy this journey through the fundamental data cleansing operations. If you have any other interested topics, please kindly let me know.