Our Blog

where we write about the things we love

06

Dec

Garbage in, garbage out: The importance of data quality

As business intelligence and data warehousing continue to mature and with increasing attention being given to “big data”, I thought it important to address an often overlooked or under-addressed issue; Data Quality. Over the course of three blog posts in the coming weeks, I’ll touch on Data Quality, Master Data Management and Data Governance. What do they mean and how are they useful to your organisation? 

Let’s start with Data Quality. Several years ago I was working with the Vice-President of US Sales at a large multi-national organisation. We designed a BI solution to provide the VP and all sales management access to sales information relevant to their area of responsibility. A data mart was created based on the needs, the customer’s database team designed and built the ETL process to populate the data mart and the BI solution was built on the mart.

The BI solution allowed everyone, including individual sales people, to see their sales performance. The project started with standard sales metrics such as actual vs. targets, year-over-year comparison, biggest customers, most profitable customers and so forth and provided the ability to slice, or view, the information in a number of different ways. As the implementation was reviewed by sales reps and district managers, it was praised for providing them valuable information. 

When I met with the VP of US Sales (the project sponsor) to review the solution all went well until we got to the biggest customer list. He examined the list for a few moments and then asked “Where is Company ABC?” (name changed to protect the innocent). He continued, “I know that if they aren’t the largest, they are nearly the largest and yet they do not appear on the list.” I had not heard this concern when working with individual sales reps and district managers.  What could be wrong? I told him I would look into the issue and get back to him.

Upon examining the customer information contained within the data mart, the following was an example of the type of records found for ABC company.

 

The type of records found for ABC Company. ABC. ABC, Inc. ABC Company. ABC, New York. ABC, New York. A.B.C.

 

When all the data from the various records were added together, it showed that ABC company was, in fact, the largest customer. However, due to data issues, that analysis could not be seen on any metric, dashboard or report.

The BI solution worked properly but the underlying data prevented it from delivering accurate information for some of the metrics. If the customer learned of this oversight, it could lead to a very awkward conversation between the vendor and the customer or – worse – result in the customer changing vendors.

What kind of a problem was this and what are options for solving it?

The problem was related to Data Quality; the data was all there, but the problem was that the data was inconsistent between regions, sales people, and so forth. The data, being inconsistent, could not be added together to get a true picture of customer activity. Further analysis showed that the issue of inconsistent data was pervasive and, due to that fact, it rendered the largest and most profitable customer metrics meaningless.

Fixing the data in the mart would address the data quality issue for the sales team. They discussed how, as the data travelled through the ETL process, they could put transformation logic in place to fix the issue of having multiple variations of a company name. However, what if the company name was entered in a format that was outside the rules implemented to catch and change names to ensure consistent naming? They would constantly have to review the data for new inconsistencies and revise the transformation logic. 

What about other groups within the organisation? Marketing – are they sending the same customer multiple copies of promotions or new product information? Accounting – are they sending bills to the wrong address? Customer Service – are they unable to find the proper customer information because they didn’t use the correct company name for the person who is calling? Fixing the data issue for the sales group is good for the sales team but leaves the root problem untouched.

In this example, the data quality issue was generated from a single source system. Imagine how data problems can escalate when you have the same data in multiple systems. For example, what if the company name in a CRM system differs from the name for the same company in an accounting system. Add in a third, a fourth, or even more additional applications with the “same” data, when there is a difference in data between the systems, which is correct?

As you can see, data quality can be a significant problem for companies. Data quality problems can be addressed in one of three ways (at a high level) and each has times where it is a great choice and there are also cost and time factors that must be evaluated in deciding the best resolution for your organisation. The three high level methods of addressing data quality issues are:

  • As an exception. For example, we’ll fix it next week (or next month, etc.) and then all is good. This is the best approach for modifying data from a legacy system and bringing it into a warehouse.
  • As an after-the-fact solution. This includes addressing the data issues via ETL when the data is moved into a data warehouse. This is something I sometimes refer to as Report Data Quality Management as it addresses the issue for any reports coming from the warehouse, but not for anything done with data in the source system.
  • Addressing the problem at the root. In this instance, the source systems are altered to ensure the data always enters the system in the correct way, ensuring there is no inconsistency either within the application or between different applications.

It is important to recognise that Data Quality is generally not a ‘tool’ problem, but rather it is usually a business process issue. Tools can be used to help but addressing business process issues is often at the root of the problem. There can be times where poor Data Quality is enabled through an application but, most generally, it is a business process issue.

But is inconsistent company naming the only type of Data Quality problem? In a word, no: Not all quality problems are the same and not all have the same impact on your business.

 

The Five Categories of Data Quality Issues

In fact, there are five major categories of Data Quality issues. They are: Data Completeness, Data Consistency, Data Acuracy, Internal Integrity and External Integrity.

 

Data Completeness

Data completeness indicates that all requisite data items have been populated. Not every data item needs to be populated in every instance. For example, in a purchasing system, if a customer is paying by credit card, the credit card number is required but if they are paying with cash, that data item is not necessary. What we need to ensure is that all the requisite data items are populated for each record.

Data Completeness has a second aspect as well: suppose that a particular data item is deemed optional and so is entered into the system sporatically. As an example, perhaps email was an optional field for a customer but then there is a desire to do an ongoing email marketing program and for that, they need to have customer email addresses. Or perhaps there is a desire to analyse information in a new way such as by gender, another data item that was not required and therefore is sporadically populated. 

This is a quality issue that cannot be found though a few simple rules, as the necessity of a data item is sometimes not realised until after an application has been in place for years; it is not until analysis of the data is attempted that the lack of data in a certain area is discovered. Additionally, even when the data item is determined to be necessary for analysis, there may still be cases where null values are to be expected as there are often exceptions to general rules.

 

Data Consistency 

The problem described in the story above is a data consistency problem. This problem is most frequently found in cases where an application allows for free-form entry of text in regard to important informational items or can also arise when merging data from disparate systems (more on that in the External Integrity section). The absence of data consistency makes accurate analysis difficult. While correcting a data consistency issue can largely be resolved in an ETL process, it means that the information in the data warehouse does not accurately reflect the source data and so reports on the same data, one from the source and one from the warehouse, may not match due to the differences.

 

Data Accuracy 

This can be a difficult item to validate and correct. What do you do if the address provided by the customer is invalid or if the person entering the address makes a mistake? Such information inaccuracies will exist in the data and can be difficult to detect and correct. While data accuracy items do not impact analysis as much as consistency or completeness, they can still impact analysis. If an analysis is being done by city and the address has the wrong city, the analysis will be incorrect. Depending on the amount of data, the inaccuracy is often only found in a small number of records and so the impact is minimised.

 

Internal Integrity 

This problem refers to referential problems within the database; in other words, how well does the data fit together?  Perhaps a company is deleted from the system but the contacts for that company are not deleted; with the company gone, those contacts are now ‘orphans’ as they have no parent. Depending on the type of analysis being done, this can lead to analysis problems. This type of issue is often caused by either a problem within the source application or through ‘meddling’ with the data via SQL calls.

 

External Integrity

External Integrity is linked with data consistency. When identical data is merged between systems, it is hoped that the data between the systems will match. If merging data between sales and accounting, the sales database may have customer ‘ABC’ while the accounting database may have customer ‘ABC, Inc.’. Maintaining external integrity means ensuring data consistency across source applications.

As you can see, there are various ways in which data quality issues can impact reporting and analysis. In the next post in this series, I’ll discuss how Master Data Management can be used to help maintain the quality of organisational data.

Posted by: Mark Worthen, Senior Consultant, Enterprise Applications | 06 December 2012

Tags: CRM, Business Intelligence, Marketing, Sales, BI, Data, Data Accuracy, Data Completeness, Data Consistency, Data Governance, Data Integrity, Data Mart, Data Quality, Master Data Management


Blog archive

Stay up to date with all insights from the Intergen blog