Navigating Data Quality: Exploring Issue Types, Impacts, and Solutions

Mike Kavalchuk

Data Governance Leader | Data Quality | Product and Program Manager | Visionary for Data Excellence | Innovation and Continuous Improvement

Published Apr 3, 2024

"Bad Data" is a face-value term that is commonly used to describe issues with data quality. While commonly accepted in use, it doesn't allow for a productive start to resolving issues - it increases the time spent investigating or elaborating about what about the data is "bad", and also reflects a lack of consideration for specificity and context. Changing this narrative into a more productive dialogue requires establishing a culture of excellence: setting a high bar for disallowing issues to persist, encourages discovery of understanding why issues came to be, and finding innovative ways to solve them.

Developing a culture of data excellence starts with understanding the different kinds of data quality issues that a team will encounter. The most commonly accepted and broadly applied framework is the six dimensions of data quality - Let's unpack each dimension and observe examples, impacts, and different actions to resolve each type.

Completion

Completion of a data record evaluates if all the required fields are filled in. A missing or incomplete field in a dataset is similar to a brick wall that is missing bricks along a row. If key bricks are missing, the distribution of weight will be affected, causing a wall to sag or collapse. In a dataset, if key fields are missing, then a data record might not be identifiable, usable, or connected to corresponding datasets. The origin of incomplete data issues is commonly driven by the timing of business and system processes, and incompatibility with the data needs of downstream people. Organizational data, such as a product list, needs to be finalized before marketing activities can begin, so it is critical that stage gates are in place to prevent incomplete data from making its way to stakeholders.

Resolving Issues with Incomplete Data

The starting point to resolve issues with data completion is with people and processes. The first step within this is to define when data needs to be complete, and ensure that downstream stakeholders have a sufficient understanding of when it will be made available for them to use. The most simple method to accomplish this is to provide a timeline of when data entry begins and when it will be complete. In addition, maintaining an open line of communication as the status progresses can be helpful to assuage stakeholders that are waiting for completion. Another helpful method is to proactively onboard and educate new stakeholders, using supporting flow charts or hosting process demonstrations to ensure proactive awareness of timelines and the general process.

The second step is to apply technology within systems to make fields required to be entered during data entry. More specifically, these controls should disallow submission of data if any of the required fields are missing. Depending on the lifecycle of the event that is being captured, completion may need to be continuously re-evaluated as the surrounding context of the data record changes.

Evolving Basis for Completion Requirements

An example of completion requirements changing scope is when a sales executive logs a prospect into the system, they may require data entry for prospect name and potential value of the sale. At the beginning of a sales engagement, it is unknown if a prospect will close, or how much they might negotiate the offer, and the final sales amount would not be logged until after an agreement has been signed. In this scenario, the updated need to ensure completion across all required fields could be enforced by maintaining a workflow that requires an entry for the "final sales amount" field when the customer record is updated from "prospect" to "customer".

A wall missing a brick along a row - Using incomplete data results in missed insights

Consistency

Consistency of data evaluates the homogeneity of values within a field or across a data set. Identifying inconsistent data is similar to quality assurance of paint colors within a car factory: a car that was intended to be painted entirely in “Pearl White”, was inadvertently assembled with a silver door. The difference between "Pearl White" and "Silver" is apparent enough to prevent a sale, therefore the car should be disassembled and return to the painting factory to ensure the correct color paint is applied. The work to correct the paint job to ensure a consistent color across the vehicle resulted in delays to production timelines and increased costs due to extra materials used. In the context of data, if data values are inconsistent (“USA”, “U.S.A.”, “United States of America”, “America”), analysts must take the time to normalize the values into a singular format ("USA").

Short Term Fixes for Inconsistent Data Result in Long Term Consequences

A quick fix that proves to be costly in the long run, is when analysts update the data to solve the inconsistencies in their own reporting or excel file, rather than in the place where the data was sourced from. While expedient, it can pose an issue in the long run, as the analyst will have to devise a way to maintain their new custom classification. In addition, this deviation of data from how it exists in the original system can become an issue when others are also using data from the same source. If the inconsistencies are not fixed at the source system, then other downstream users will also come up with their own classifications. Reports and analysis will then be subsequently built on top of custom classifications, which increases the reliance on maintaining the house of cards. What would have been a simple fix at the source system and applied for all users, now results in inconsistencies from department to department, mounting technical debt, and increased work to maintain.

Optimal Solutions for Inconsistent Data

Consistency of data can be enforced in systems by using the appropriate input types: ensuring that all data fields are “drop downs” instead of “free text fields”, disallowing certain character types (such as numbers only), or having a master data management team that is responsible for reviewing data and assigning it into the right category. It is critical that there is a robust feedback loop between the people creating the data, people reviewing or processing the data, and the people who eventually use the data in reports or analysis. Data Governance or other cross functional committees can be used to manage the queue of issues, and facilitate changes across all areas of the data lifecycle. These changes would include updates to source systems to change field types or values, changes to review processes, and accommodation of the updates in reporting.

A slightly off shade panel of a car - Fixing inconsistency results in wasted effort

Accuracy

Accuracy of a dataset evaluates if the values within a data record matches reality. Measuring accuracy can be similar to the event of purchasing a new T-Shirt: You’re shopping and find a T-Shirt you like in the $20 section and go to the register to check out. When the employee scans the price tag, it rings up at $100. You’re shocked at the price, and inquire that you believe the price should be $20, based on where you found it. The cashier responds that it was inaccurately placed in the sale section by a team member. While this was an honest mistake by the employee, it’s a discouraging experience for you as a shopper, and a loss to the company in a potential sale. Inaccuracy in a dataset can manifest in examples such as exorbitantly high values (i.e. Car Mileage = "999,999,999"), or if the data doesn’t match up to expectations (i.e. Sky Color = “Green”).

Impacts of Inaccurate Data

Inaccuracy across a dataset can eliminate the ability to extract meaningful insights because analysts might not trust that the data is reliable. Being forced to use inaccurate data also requires additional work from analysts to investigate and understand why the data is wrong, and then find a way to fix it or work around it. Resolving inaccurate data may also require investments from data engineers to assume and overwrite the values in a way that satisfies their business stakeholders - though as noted above, it is not a preferred long term solution.

Resolving issues with Data Accuracy

Data accuracy can be systematically enabled by developing processes that evaluate data against business rules. Business rules can be configured to check if a inventory record for a product has a “sale” tag applied to it, and cross verify at the register to ensure that a discount is appropriately applied. Data accuracy requires tight coupling of business rules, to operational processes, and to the technical systems within. Supporting processes or workflows that mandate review by business stakeholders should be owned and regularly facilitated by master data management. Approval workflows ensure that there are multiple people in place to validate the accuracy of data so that it is robustly vetted before it becomes available to others in the organization.

Inaccurate price tags - Incorrect data results in loss of trust and potential loss of revenue

Uniqueness

Uniqueness of a dataset evaluates the presence of duplicate values across a given field, and whether or not they should be allowed. A common scenario of uniqueness issues that most people are familiar with is duplicate customer records in reward systems. When making a purchase at a retail store, I want it logged to my account for ease of returns and to generate points that can be redeemed for future discounts. However, when the cashier asks for my name they tell me they have multiple instances of my name in their system. My expectation of the system and the employees of the store is that I should have a singular customer profile. If I have three separate profiles, my rewards wont accurately display, and it will take longer for employees to find my history of purchases to make a return if needed.

Challenges with Uniqueness Violations in Data

If poor customer profile management continues across a business it will negatively impact analytics and reporting. This would manifest in inaccurate metrics such as the number of repeat customers, reduced average customer spend, and also simply results in a poor customer experience at checkout.

Recommended by LinkedIn

9 signs your data quality program is off track and 4…

Jose Almeida 2 years ago

The 7 Potential Benefits of Having a Data Glossary or…

Nicola Askham 1 year ago

Data Quality: A Shared Responsibility Across the Data…

Arockiaraj Arockiam 3 months ago

Using Culture and Technology to fix Uniqueness Issues

Issues surrounding uniqueness of a dataset can be resolved at the front lines by building a culture of vigilance to find, use, and consolidate customer profiles. This may involve improving search capabilities within business systems, so that existing profiles or data records can be identified for updates, rather than needlessly creating whole new profiles. Some companies like Marriot, also offer customers the ability to self service combining their accounts. It is also recommended to use tools that can monitor and alert the team if duplicate data records are detected.

Duplicate Customer Record - Fixing duplication issues incurs additional costs and poor customer experience

Timeliness

Timeliness of a dataset evaluates if data has been modified, refreshed, or made available within the agreed upon timeline. Timeliness of data can be compared to hailing a cab, or calling an uber: When the arrival time of the car is stated to be 10 mins away, the expectation is that the car is in fact 10 minutes away, and not 30 minutes. Timeliness within the context of data and analytics is also of considerable importance, as reporting should be refreshed to reflect the most up to date metrics of the business and be available when it is needed. Analysts and executives expect data to be refreshed overnight, or periodically throughout the day so that appropriate action can be taken from reports or satisfy operational needs.

Challenges that come with Untimely Data or Reports

Using data that has not been refreshed within a timely manner or within the expected timeframe can result in inaccurate data, and decreased trust in reports. Suppose Amazon is planning to measure the success of Cyber Monday promotions: On Tuesday morning, if the provided data still reflects the metrics from Sunday, analysts and business stakeholders wont be able to evaluate the success of the sale or to recommend alternative actions, such as extending the sale.

Monitoring for and Resolving Data Timeliness

The most simple fix for timeliness of data can be resolved by adjusting the timing from source systems of when data should be sent to analytical systems. The data processing may have to be scheduled to run an hour earlier so that it processes in time. Another fix is to resolve the business processes by creating efficiencies so that processes are executed faster, or even by changing the sequence of events to limit dependencies. The last option, though not as preferable, is to adjust stakeholder expectations for when data will be available based on the business process (i.e. acknowledging that retail ledgers wont be closed or confirmed until end of day). A prime example reframing expectations through design is how the Houston Airport reduced customer complaints when waiting for baggage claim - they simply moved the baggage claim to be a further walk away, so that the time walking was longer than the wait time to deliver the bags.

Customers expect timely service when hailing an uber - Untimely responses results in cancellations + revenue loss

Validity

Validity of a dataset or data point evaluates if the value within a field is correct based on business or technical rules, or if data values match the intended structure. Validity of data can be compared to managing incoming guests at a bar: A bouncer will check the identification of incoming patrons to ensure that only people who are aged 21 and above are allowed in, and anyone below that age will not be allowed to enter. The bouncer may also monitor the number of guests that are inside in an establishment to limit the total number of people, to minimize crowding for the guest experience and adhere to fire protocols. A relative data scenario can be applied to a list of USA zip codes: They should only consist of five digit numbers with no alphabetical characters (zip code = "90210"). If a Canadian zip code is entered, which contains alphabetical letters ("V8E 1A9"), it should be disqualified or not allowed because the list is supposed to be USA zip codes only.

Outcomes of Invalid Data

The biggest issues with invalid records within a dataset is that it causes numerous delays in driving insights from data. Analysts have to spend the time understanding how an invalid record made its way into the dataset, scope potential other invalid records, justify if it can still be trusted, and figure out how to normalize or clean the dataset. It may also require isolating any data records that break the rules, which results in incomplete perspectives, given the data made its way in their somehow.

Resolving Issues with Invalid Records

Resolving issues with the validity of a data record be solved by similar solutions noted above: Creating field controls at the point of data entry, establishing monitoring and alerting that evaluates the profile of a dataset, or even using master data management tools that automatically normalize and isolate bad values.

A dog among sheep - Invalid records results in loss of time from questioning integrity

Summary

While each measure of data quality is distinct in its own right, they are all closely related: one dimension may result in another dimension, such as timeliness issues might result in completion issues. Dimensions can also share solutions, such as creating field controls could solve both validity and completion.

Despite their interrelated nature, resolving data quality issues can range from quick fixes, to complex projects. As data quality are identified within an organization, I would recommend referencing the specific kind of data issues that are being encountered, to best manage expectations of stakeholders, and support a faster resolution process.

My belief is that most data quality solutions are people & process issues. It involves understanding who is entering the data, how the expectations of data quality are being set/enforced/incentivized, what business processes data is being used in, when it is needed, and so on. Unravelling the context and organizational dynamics behind all of these factors can take a lot of time to fully understand and careful change management to successfully implement solutions across business units.

That being said, applying Data Quality solutions becomes a lot easier with dedicated tools that support the practice. Some data tools that are recommended to deploy in pursuit of improving data quality are:

Data Catalog - A singular place to document the context behind the data: its meaning, availability, applicable policies, and reflect issue status if there are any data quality issues.
Observability & Monitoring - A way to evaluate datasets, identify data quality issues, and create workflows that manage prioritization, ownership, and resolution
Metadata Management - A singular place to manage data transformation and normalization of data, when data comes from numerous places.

All said and done, the purpose of this article was to provide some real world examples of the different kinds of data quality, and provide some high level direction to apply within your own organization. If you can relate, or have anything you'd like to add - drop it in the comments below!

#DataGovernance #DataQuality #DataAnalytics #BusinessAnalytics #DataProducts #DataMesh #InformationTechnology #observability #monitoring

Data Discourse with Mike K

294 followers

+ Subscribe

Mirko Peters

AI & Data Marketing Maven: Turning Your Tech into Talk with a Dash of Humor and a Heap of Results – Let's Connect!

9mo

Turning challenges into opportunities in the data quality space can be a game-changer! 🌟

1 Reaction

To view or add a comment, sign in

See all

Navigating Data Quality: Exploring Issue Types, Impacts, and Solutions

Mike Kavalchuk

Data Governance Leader | Data Quality | Product and Program Manager | Visionary for Data Excellence | Innovation and Continuous Improvement