How to Improve Customer Data Quality

14 years ago

Saumya Chaki, Information Management Newsletters, January 19, 2010

In any customer-centric business, be it hospitality, banking, retail or insurance, there are numerous touchpoints where the consumer interacts with the business. Many interactions take place between the consumer and the business through various direct and indirect channels: direct marketing campaigns (email, mailers, telemarketing, etc.); points of sale; information kiosks; online shopping portals; and feedback forms for services rendered.

During all these transactions or points of contact, consumer data is collected in varying ways. The trouble lies in the lack of a consistent framework in collecting consumer attributes. Most organizations collect the same consumer through multiple channels with no consistency in the attributes collected. Hence when these organizations build data warehouses and data marts to study consumer behavior, they lead to a large number of duplicates in the consumer tables in the warehouse or mart. This can be disastrous for any business.

It can result in multiple mailers to the same consumers or to consumers who have opted out of direct marketing campaigns, resulting in legal complications and loss of consumer loyalty. Any ROI analysis would yield skewed figures if consumer data is not consistent. A consistent or single view of consumer data across the enterprise is necessary to prevent such scenarios.

Consumer Deduplication Strategy

Data deduplication is the process of defining duplicate consumer data in consumer-centric databases and seeking corrective action to cleanse the data from the duplicates and ensuring that no coherent, accurate and relevant data is lost in the process.
Follow these steps to formulate and implement a successful consumer deduplication strategy.

1. Understand Data Quality

Data quality issues include inconsistency in attributes, invalid data and duplicate records. It is recommended that data quality be enhanced and issues be resolved before a deduplication process is run. This ensures that the deduplication process runs on better quality data.

2. Investigate Data and Data Quality Issues

Data investigation is important not only to determine the data quality issues but also to understand the key attributes needed to define a consumer uniquely based on data profiling. The records in the data environment under investigation must be a good representative sample of data quality issues and deduplication scenarios in the production database. Data investigation can be done with tools or manually, using written SQLs. Data patterns are better exposed by automated tools and may be a preferred approach.

3. Determine Match Rules and Criteria

Results of the data profiling exercise should be published and proposed consumer attributes to be used to match records must be understood and confirmed by business users of the system. This is important to ensure that the match criteria make business sense. Typically, matching can be of three types, namely commercial matching, household matching and individual matching.

Commercial matching involves matching businesses or consumers belonging to business houses. Household matching involves matching consumers to households. Often country specific third-party data is used to do household or family matching. There are, however, some scenarios which need to be handled when one deals with third-party data:

Third-party data providers normally charge for each instance of consumer verification. This may turn out to be a costly, time-consuming exercise and is usually done once a month or at larger time intervals (like bimonthly or quarterly).
Third-party consumer data may not exactly match the consumer data that an organization builds up over a period of time. When no data is found in the third-party database corresponding to consumer data in organizational database, a decision needs to be made on how these consumers will be matched.

Individual matching involves matching consumers belonging to the same household and is usually done after household matching. In some cases it may be useful to match based on other useful attributes of consumer such as number, name suffix, gender, etc. Usually matching is performed by data cleansing tools.

4. Identify Survivorship Criteria

Now that records belonging to the same matched group are identified, select a survivor record in each of the matched groups. Survivorship criteria is a product of the initial data investigation/data profiling exercise. It is highly recommended that business users agree with the survivorship criteria, because identifying survivors based on attributes that have limited business significance may be detrimental to the efficiency and quality of the deduplication process. The best way to identify the survivor is to retain the record that best matches the survivorship criteria. As consumer data often is highly sensitive, it is important to retain the best consumer data possible.

5. Determine Merge Rules and Criteria

Once the significant problem of finding the survivor is resolved, it is now important to realize that some attributes of the records marked as duplicates may be more recent, more complete or of better quality. In such an instance, it is necessary to merge these better attributes of the duplicate records into the survivor record. Again this action needs to be performed on the basis of merge rules/criteria. Merge rules are also defined on the basis of data investigation and data profiling. For instance, a merge rule could be to update the survivor records address field with the address field of the record with the longest address field values. Where there are date fields, it is necessary to retain the latest date (i.e.., indicating last change of address). It is highly recommended that these merge rules are certified by business users.

6. Maintain Survivor Duplicates Trail History

It is important to note that while a set of records may have been marked as duplicates and must be purged from the consumer related tables in the warehouse, it may be worthwhile to retain these deleted records in trail tables which store the relationship of the duplicate record to the survivor record as well as the deleted records attributes.

7. Establish Match Rules Repository

It is highly recommended that all defined match rules are stored in the warehouse as part of match rules master table. This table captures the relationship between duplicate records and reason code for matching. The match rules master table would be a one-stop shop wherein one can analyze the current match rules.

8. Establish a Reconciliation Report and Measurement for Efficiency of Deduplication

Reconciliation reports are highly recommended in consumer deduplication scenarios. These reports give vital information about the dedupe process in terms of the number of records marked as duplicates due to a given iteration of matching. Also, details about survivor-duplicate relationship can be determined from these reports. The overall efficiency of the dedupe process can be measured in terms of the percentage of duplicates that are removed by the process. For example, if 15 percent of the records in the warehouse were duplicate records prior to running the deduplication process and after running the process the duplicates were reduced to 10 percent, then the efficiency of the deduplication process would be 33 percent, as shown by the formula:

(% of duplicates before running dedupe – % of duplicates after running dedupe) * 100% of duplicates before running dedupe

The ROI on the dedupe process can be similarly measured by finding how much the direct marketing costs have gone down after dedupe. This can be derived using the following equation:

(direct mail costs before running dedupe – direct mail costs after running dedupe) * 100 direct mail costs before running dedupe

Deduplication in Enterprises – A Status Check

While we addressed the strategy to capture the deduplication in the previous sections, this strategy can be applied in the data warehouse/data mart or even in the source systems. The implementation of deduplication in the source systems should be seriously considered as a viable option for two primary reasons:

Integration of source systems with data cleansing systems – for instance, some software provides integration with transaction systems and customer relationship management systems. This allows a lot of flexibility to organizations in different business domains to cleanse their systems at the source layer.
Organizations have realized the power of clean data through an industry-wide effort to understand the benefit of clean data in the source systems and downstream data warehouses and data marts. As a result, lots of organizations are ready to invest in data cleansing at the source layer. The benefits are manifold. If data is cleaned in the source system, any number of downstream systems benefit from better quality data. If the deduplication strategy was done at the warehouse level, the source would still have duplicates, and any application that depended on the source would have to cleanse the data separately in each of the downstream applications.

There are, however, issues that need to be considered when implementing deduplication in the source systems. It is often a complex process and may have some effect on the data availability timelines. Hence, where transaction systems have high performance benchmarks, it may be useful to do the deduplication in the data warehouse or other downstream systems.
For real-time data warehouses, it may be a better strategy to implement the data cleansing in the warehouse because the data needs to flow from the transaction system into the warehouse in real-time or near real-time scenarios.

Deduplication Across Enterprise Solutions

Dedpulication processes can be applied to the following enterprise areas to ensure that businesses get the best out of their data.

Supply chain management. Data quality issues are of key importance in managing supply chains because data quality affects one’s ability to support not only your own but also the network’s business processes with reliable, useable data. Typically modern supply chains involve interactions between not only companies whose operations are run by the supply chain but also partners, suppliers, retailers and warehouse operators. Good quality data ensures the ability to track vendors, inventory management, customer invoicing, business analytics and business effectiveness based on more accurate and timely information. Deduplication helps improve the data quality in supply chains by standardizing product information, unifying complex vendor views and improving contact information for increasing efficiency of product delivery and services routing.

Enterprise resource planning. ERP systems are also data centric and are normally designed around key business processes like materials management, financial planning, human resources and inventory control. Hence, data quality is of paramount importance here as these transactions systems are also often the backbone of data flowing into downstream data warehouses and data marts.

Business intelligence/data warehousing. BI dashboards/portals and data warehouses are the backbone of decision support systems used by enterprises today. Hence, the quality of data in these warehouses and marts are of vital importance to accurate business information and business strategy. Enterprises are increasingly trying to integrate data across multiple source systems to get a consistent view of business process data, and deduplication plays a vital role in achieving this goal.

Customer relationship management. As mentioned, customer data is sensitive and holds key information about customer preferences. This makes it imperative for companies to enhance the quality of customer data to build better relationships with customers. The key to more efficient CRM is a unified view of the customer across the enterprise.

Regulatory compliance. Local and global legislation around financial asset control and privacy supervision are forcing enterprises to reexamine the accuracy and reliability of information. Traditionally enterprises have tackled compliance projects serially, reacting to regulations with new IT initiatives. However, increased regulatory demands such as Basel II and Sarbanes-Oxley are demanding a reconsideration of strategy. Multiple data integration efforts can leverage common metadata and business rules that allow better data to benefit all business processes. In this manner, compliance actually moves from a business cost to become a competitive advantage.

Saumya Chaki is a principal consultant with the business intelligence practice at PwC India. He has 12 years experience in IT consulting specializing in DW and BI strategy. His interest areas include data architecture, data governance and data quality, MDM, business intelligence ROI and Web semantics.