Data Cleaning 101 in SQL — #4.1 Where does data duplication come from?

Unlike the Queen in Snow White, we data analysts do not possess a Magic Mirror that reveals the location of duplicated data

9 min readSep 4, 2023

An 1852 illustration shows Snow White's evil stepmother gazing into her magic mirror. Her famous question includes an ambiguous word: "fairest." (ref)

Queen:
Mirror, Mirror, on the wall, now, who is the fairest of us all?
Magic Mirror:
Over the seven jewelled hills, beyond the seventh fall, in the the cottage of the Seven Dwarfs, dwells Snow White, fairest of us all.

Ah, the story of Snow White and the Evil Queen never gets old, does it? But let’s not forget the powerful Magic Mirror that the Queen used to find Snow White. Speaking of magical tools, do you have one to locate duplicate records in your data quickly? I mean, you wouldn’t want to be caught off guard by data duplication, right?

Over the years, I have encountered countless cases of data duplication, some as simple as pie and others that were downright terrifying. Don’t be fooled by how easy it may seem to spot duplicates. The issue can quickly spiral out of control and cause chaos. But fear not, my friend! This article will reveal the true identity of the duplication monster and equip you with the strategies you need to hunt it down like an experienced analyst.

The Bad Luck We Avoid

What a bad luck!

I can easily imagine various scenarios when you tackle the data duplication problem. It can show up in the most unexpected places, like when you are strolling along on a regular morning or when you are questioned about a suspicious peak in data during a meeting. This is how things usually happen:

One day, your boss asks why the ads revenue had soared ridiculously by 200%, and…

Duplicate data is a serious problem that leaves severe consequences in many aspects. In certain circumstances, it appears in many complex forms rather than just a copy of an original record. So, where do the duplicated records come from?

Causes of the data duplication

From my personal experience, I have observed that the issue of data duplication can arise due to a multitude of errors, which can be succinctly summarized as follows:

Inadequate Data Preparation: Ineffectual execution of data preparation while merging multiple data sources.
Lack of Data Governance: Unregulated data entry procedures and modification of user data over time.
External Data Integration: Lack of consolidation during integration with external sources.
System Deficiency: Software bugs and system errors.

Deficient data prep

Commonly, you have to pull data from multiple tables or even different data sources and subsequently combine them all to obtain the desired outputs in your daily practices. There is no such thing as an ideally_cleaned_ready_to_use_table exist in the first place. Chances that you will hit some duplicated records after joining numerous tables without effective data cleansing procedures. Let’s take a look at the following example.

Join operation can produce duplicate records, edited by the author

Upon merging the booking_table and the hotel_table, you may observe the presence of similar data instances of accommodation facilities id numbers `130890` and `97014`. Although the occurrence of records that appear identical in the resultant table is not necessarily incorrect, as they may represent distinct bookings at the same homestay, it is advisable to exercise caution when encountering repetitive data points, as there is a high likelihood that data duplication has occurred.

Lack of data governance

For those of you who may be unfamiliar with the term, “data governance” is essentially a framework or procedure that guarantees the availability, quality, and security of an organization’s data flow, ranging from extracting, transforming, loading, and distributing the data.

As you can imagine, a company that lacks proper data governance or management may be prone to erroneous duplicate data entries. This can occur due to a variety of reasons such as:

Inconsistent data ingestion processes can be a major source of data duplication. For example, a batch of notification data may be ingested twice in the morning due to a system glitch on Airflow. However, it is essential to note that duplicated records can also arise from human error. For instance, after a long day of moving from store to store, a salesman may accidentally input the sales of a convenience store three times because he believes the CRM system is unresponsive. In reality, the issue is perhaps caused by an unstable Wi-Fi network.
Insufficient data validation processes: The entry errors in the aforementioned examples should have been eliminated if there had existed an automatic data validation procedure. For instance, the Data Team can implement an ingestion sensor that checks whether the incoming requests from the salesperson’s handheld machine are too close to each other (say, under 100ms), which could be a result of network errors leading to multiple records being fired into the system.
Inadequate data quality control measures can also contribute to data duplication issues. Data validation is not sufficient on its own. The entire data system may contain multiple layers of ETL jobs that transform raw data, and duplication problems can arise during these processes. In certain circumstances, an error in an upstream job (such as a double-ingested notification job) can result in the processing and manipulation of duplicated data in downstream jobs, leading to a twofold number in the notification delivery report.

Therefore, companies must establish robust data governance policies and procedures to ensure the accuracy and reliability of their data.

External data source integration

In today’s data-driven world, it is often necessary to extract data from external sources such as Facebook Graph API, Google Ads API, Salesforce API, and many others. However, the structure of these external data sources is not always standardized, and each party records data in their own unique way. As a result, certain types of data such as customer information, user segments, device profiles, and other kinds of static information can be duplicated across different sources. For instance, a customer may be recognized as a high-income female adult in one source, but in another source, that same customer may be identified as a fresh female graduate. These discrepancies can lead to confusion and inaccurate insights if left unaddressed.

Illustration of different disguises (ref)

To ensure the accuracy and reliability of data extracted from external sources, it’s essential to establish proper data integration and management processes.

System deficiency

System errors are often considered the most common wrongdoer for any kind of bad thing happening in modern computer history. These errors can take many forms, ranging from simple bugs caused by incorrect configurations to more complex issues like a shortage in system resources leading to delayed and erroneously retried pipelines. Other factors such as an abnormal peak in server-to-server data streaming or a comprehensive sequence of bugs affecting multiple ETL jobs can also contribute to system errors.

Unfortunately, system errors are often unavoidable and can be the most unpredictable reason for data duplication. They can occur at any time and without warning, making them difficult to anticipate or prevent.

Consequences of the data duplication

When it comes to data duplication, the consequences can be quite severe.

Misleading business insights: When raw data is duplicated, it can have a significant impact on the accuracy of subsequently reported numbers. Thus, data analysts may translate the data into inaccurate insights and present them to business stakeholders like the marketing team, the sales team, or the product team. This, in turn, can lead to misleading business decisions.
Flawed business decisions: If a company’s marketing team relies on duplicated data to make decisions, they may end up targeting the wrong audience or investing in ineffective campaigns. This can result in wasted resources and lost revenue.
Expensive storage & processing cost: It’s no secret that duplicated data can be costly for businesses. The cost of storing and processing duplicated events can quickly add up, resulting in wasted resources and increased expenses. This is especially true for companies that deal with large volumes of data regularly.
Incorrect customer information: Duplicated data can lead to reputational damage if it causes customers to receive duplicate communications or incorrect information.

In short, the consequences of data duplication can be significant and far-reaching. To avoid these issues, companies need to establish robust data governance policies and procedures that ensure the accuracy and reliability of their data. By doing so, they can make informed business decisions based on accurate insights and maintain the trust of their customers.

Common duplication variants

There are three types of duplication in data as follows.

Exact duplication
Mismatch data across sources
Non-exact duplication

Exact duplication

This type of data duplication occurs when the same or nearly identical data is recorded for the same ID or entity. This can be caused by a variety of factors, including system deficiencies, data entry errors, or poor data preparation.

I just borrow this as an example, indeed it is not wrong.

It is worth noting that for the `parent_id` numbers 2031, 2052, and 2056, the information in the `country_code`, `display_feature_type`, and `status` fields is identical.

Mismatch data across sources

When data is partially mismatched for the same ID or entity across multiple datasets or data sources, it is recognized as partially matched information between sources. Data entry errors can be a common cause of partial data mismatches, as individuals may accidentally record incorrect or incomplete information. Additionally, external data source integration can also contribute to this kind of data duplication.

User profile mismatch between systems is a common example of data duplication. This can occur when user profile data slowly changes over time, resulting in different copies of each stage being kept in different data sources. For example, a customer may upgrade from an iPhone 12 to a new iPhone 13, but some sources may not have updated the user’s new status. Similarly, a user may move to a new city on a business trip, but this information may not be reflected in all data sources.

Mismatch device information of users between Firebase & Salesforce, example by the author.

Non-exact duplication

Non-exact duplication refers to the situation where two or more documents or pieces of content are similar but not identical. This can happen due to various reasons such as paraphrasing, summarizing, or using synonyms. In the context of data, this type of duplication can result in partially different information for the same ID or entity. I

There are several factors that can contribute to non-exact duplication, including data entry errors, lack of data governance, and system deficiencies. For example, a saleswoman may quickly write down her client’s name as MJ Sue because she needs to rush over to the client building. However, the client had previously entered her name as Mary Jane on the company’s website to request a demo. This can lead to non-exact duplication and partially different information for the same client.

Non-exact duplication & partially different information of Mary Sue (ref)

This visual representation can aid in comprehending the concept of data duplication more easily. Please note that:

Near duplicates include partially different information (the image of dogs) & non-exact duplication (the image of Micheal Jackson’s album).
Visual-similar images are equal to partially different information.
Full duplicate images are similar to having identical copies of the same dataset stored in different locations or systems.

Examples of visually similar, exact duplicates, and full duplicates (ref)

References

IBM, What is Data Governance? link