Data Integration versus Data Quality: Friends or Foes?

Posted on July 3, 2018

Contributed by: Wilfried Lemahieu, Seppe vanden Broucke, Bart Baesens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!

This article is based on our upcoming book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.

Traditionally, applications and databases were organized around functional domains such as accounting, human resources, logistics, CRM, etc. Every department or business unit worked with its own data silo (e.g., a file or database) with no cross-department integration. Whereas the silos mainly aimed at operational support, a next phase saw the emergence of business intelligence (BI) and analytics applications, fueled by the need for data driven tactical and strategical decision making, with a company-wide impact. To sustain this company-wide view, data from the silos was transformed, integrated and consolidated into a company-wide data warehouse. ETL (Extract, Transform, Load) processes supported the asynchronous data extraction and transfer from the source systems (the operational data silos) to the target data warehouse. However, the ETL process was typically time-consuming, so there was a certain latency between the up-to date operational data stores and the slightly outdated data warehouse. This latency was tolerable: real-time business intelligence was not the goal in traditional data warehousing.

Because of this evolution, we were confronted for nearly two decades with a dual data storage and processing landscape, supported by two very distinct scenes of tool vendors and products. Nowadays, we see a complete convergence of the operational and tactical/strategic data needs of the corresponding data integration tooling. This trend was initiated by new marketing practices centered on proactive (instead of reactive) actions requiring a complete understanding of the customer, and quickly spread towards other functional domains. It culminates in the term operational BI, with a twofold meaning. First, analytics techniques are more and more used at the operational level as well by front-line employees. Second, analytics for tactic/strategic decision making increasingly uses real-time operational data combined with the aggregated and historical data found in more traditional data warehouses. This evolution poses interesting challenges to the landscape of data storage and data integration solutions.

Data integration aims at providing a unified view and/or unified access over different, and possibly distributed, data sources. The data itself may be heterogeneous and reside in difference resources (XML files, legacy systems, relational databases, etc.). The desired extent of data integration will highly depend upon the required Quality of Service characteristics. Data will never be of perfect quality, so a certain level of inaccurate, incomplete or inconsistent data may have to be tolerated for operational BI to succeed. Different data integration patterns exist to provide this unified view:

Data consolidation aims at capturing the data from multiple, heterogeneous source systems and integrate it into a single persistent store (e.g., a data warehouse, data mart or data lake). This is typically accomplished using ETL routines.
Data federation typically follows a pull approach where data is pulled from the underlying source systems on an on-demand basis. Enterprise Information Integration (EII) is an example of a data federation technology and can be implemented by realizing a virtual business view on the dispersed underlying data sources.
Data propagation corresponds to the synchronous or asynchronous propagation of updates or, more generally, events in a source system to a target system. It can be applied in the interaction between two applications (Enterprise Application Integration (EAI)) or in the synchronization between two data stores (Enterprise Data Replication (EDR)).
Changed Data Capture (CDC) can detect update events in the source data store, and trigger the ETL process based on these updates. In this way, a ‘push’ model to ETL is supported.
Data virtualization builds upon the basic data integration patterns discussed previously, but isolates applications and users from the actual integration patterns used.
Data as a Service (DaaS) offers data services as part of the overall Service Oriented Architecture (SOA) where business processes are supported by a set of loosely coupled software services.

The aspect of data integration is also heavily related to that of data quality. Data quality can be defined as “fitness for use,” meaning that the required level of quality of data depends on the context or task at hand. Data quality is a multi-dimensional concept involving various aspects or criteria by which to assess the quality of a data set or individual data record. The following data quality dimensions are typically highlighted as being important:

Data accuracy: referring to whether the data values stored are correct (e.g., the name of the customer should be spelled correctly)
Data completeness: referring to whether both metadata and values are represented to the degree required and are not missing (e.g., a birth date should be filled out for each customer)
Data consistency: relating to consistency between redundant or duplicate values, and consistency amongst different data elements referring to the same or a related concept (e.g., the name of a city and postal code should be consistent)
Data accessibility: which reflects the ease of retrieving the data

Approached from the angle of data integration, it is important to mention that data integration can both aid in improving data quality, but might also hamper it. Data consolidation and ETL allow performing different transformation and cleansing operations, so the consolidated view of the data should be of higher quality, but one might – appropriately so – wonder why it wouldn’t be better to invest in data quality improvements at the source. The same remark holds for environments where, throughout time, different integration approaches have been combined, leading to a jungle of legacy and newer systems and databases that now all must be maintained and integrated with one another. This is a key challenge for many organizations and one that is indeed very difficult to solve. In these settings, master data management is frequently mentioned as a management initiative to counteract these quality related issues.

Master data management (MDM) comprises of a series of processes, policies, standards, and tools to help organizations to define and provide a single point of reference for all data that is “mastered.” Its key concern is to provide a trusted, single version of the truth on which to base decisions as to ensure that organizations do not use multiple, potentially inconsistent versions of the same concept in different parts of their operations. The focus is on unifying company-wide reference data types such as customers and products. This might seem straightforward, but imagine a situation of a large bank where one department is using an operational customer database for its day-to-day interactions, while the marketing department is setting up a campaign by selecting leads from a data warehouse using a BI tool, which however is running behind compared to the operational view. A customer that has just taken out a mortgage at the bank might receive a mortgage solicitation a week later, as the customer information used by the marketing department lacks fast or solid integration with the customer operational systems. Putting a data federation or virtualization solution on top can help, but converting all departments and applications to go through this newer layer can take years, let alone coming up with a clear mapping on the current data and systems overview and architecture. Modern information systems can be very complicated and entangled constructs, which should emphasize the necessity for master data management.

Setting up a master data management initiative involves many steps and tools, including data source identification, mapping out the systems architecture, constructing data transformation, cleansing and normalization rules, providing data storage capabilities, monitoring and governance facilities, and so on. Another key element is a centrally governed data model and metadata repository. Perhaps surprisingly, many vendor “solutions” to set up an MDM initiative look very similar to data integration solutions, i.e., data consolidation, federation, propagation or virtualization techniques. These integration approaches can be used as a method to achieve maturity in master data management. Note however this assumes these solutions are used to set up a trusted, single version of the truth of the data on which decisions are based, and that no other representation of the data is used anywhere in the organization. The challenge hence lies in the execution: following these core principles is a daunting task, and integration officers must avoid adding yet another few strands of “spaghetti systems” and cross links between data stores, causing the master data repository to become yet another half-integrated data silo.

For more information, we are happy to refer to our book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.