Contributed by: Wilfried Lemahieu, Seppe vanden Broucke, Bart Baesens
This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at firstname.lastname@example.org and let’s get in touch!
This article is based on our upcoming book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.
Data management entails the proper management of data as well as the corresponding data definitions or metadata. It aims at ensuring that (meta-) data is of good quality and thus a key resource for effective and efficient managerial decision making. Data quality (DQ) is often defined as ‘fitness for use,’ which implies the relative nature of the concept. Data that is of acceptable quality in one decision context may be perceived to be of poor quality in another decision context, even by the same business user. For instance, the extent to which data is required to be complete for accounting tasks may not be required for analytical sales prediction tasks. Data quality determines the intrinsic value of the data to the business. Information technology only serves as a magnifier for this intrinsic value. Hence, high-quality data combined with effective technology is a great asset, but poor quality data combined with effective technology is an equally great liability. This is sometimes also referred to as the GIGO, or Garbage In, Garbage Out principle, stating that bad data results into bad decisions, even with the best technology available. Decisions made based on useless data have cost companies billions of dollars. A popular example of this is the address of a customer. It is estimated that approximately 10% of customers change their address on a yearly basis. Obsolete customer addresses can have substantial consequences for mail order companies, package delivery providers or government services.
Poor DQ impacts organizations in many ways. At the operational level, it has an impact on customer satisfaction, increases operational expenses, and will lead to lowered employee job satisfaction. Similarly, at the strategic level, it affects the quality of the decision-making process. The magnitude of DQ problems is continuously being exacerbated by the exponential increase in the size of databases. This certainly qualifies data quality management as one of the most important business challenges in today’s data-based economy.
Organizations are hiring various data management related job profiles to ensure high data quality and transforming data into actual business value. In what follows, we review the information architect, database designer, data owner, data steward, database administrator and data scientist. Depending on the size of the database and the company, multiple profiles may be merged into one job description.
The information architect (also called information analyst) is responsible for designing the conceptual data model, preferably in dialogue with the business users. He/she bridges the gap between the business processes and the IT environment and closely collaborates with the database designer who may assist in choosing the type of conceptual data model (e.g. EER or UML) and the database modeling tool. A good conceptual data model is a key requirement for storing high-quality data in terms of data accuracy and data completeness.
The database designer translates the conceptual data model into a logical and internal data model. He/she also assists the application developers in defining the views of the external data model as such contributing to data security. To facilitate future maintenance of the database applications, the database designer should define company-wide uniform naming conventions when creating the various data models which enforces data consistency.
Every data field in every database in the organization should be owned by a data owner, who is in the authority to ultimately decide on the access to, and usage of, the data. The data owner could be the original producer of the data, one of its consumers, or a third party. The data owner should be able to fill in or update its value which implies that the data owner has knowledge about the meaning of the field and has access to the current correct value (e.g. by contacting a customer, by looking into a file, etc.). Data owners can be requested by data stewards (see below) to check or complete the value of a field, as such correcting a data quality issue.
Data stewards are the DQ experts in charge of ensuring the quality of both the actual business data and the corresponding metadata. They assess DQ by performing extensive and regular data quality checks. These checks involve, amongst other evaluation steps, the application or calculation of data quality indicators and metrics for the most relevant DQ dimensions. Clearly, they are also in charge of taking initiative and to further act upon the results of these assessments. A first type of action to be taken is the application of corrective measures. However, data stewards are not in charge of correcting data themselves, as this is typically the responsibility of the data owner. The second type of action to be taken upon the results of the data quality assessment involves a deeper investigation into the root causes of the data quality issues that were detected. Understanding these causes may allow designing preventive measures that aim at eradicating data quality problems. Preventive measures may include modifications to the operational information systems where the data originate from (e.g., making fields mandatory, providing drop-down lists of possible values, rationalizing the interface, etc.). Also, values entered in the system may immediately be checked for validity against predefined integrity rules and the user may be requested to correct the data if these rules are violated. For instance, a corporate tax portal may require employees to be identified based upon their social security number, which can be checked in real-time by contacting the social security number database. Implementing such preventive measures obviously requires the close involvement of the IT department in charge of the application. Overall, preventing erroneous data from entering the system is often more cost-efficient than correcting errors afterward. However, care should be taken not to slow down critical processes because of non-essential data quality issues in the input data.
The database administrator (DBA) is responsible for the implementation and monitoring of the database. Example activities include: installing and upgrading the DBMS software, backup and recovery management, performance tuning and monitoring, memory management, replication management, security and authorization, etc. A DBA closely collaborates with network and system managers. He/she also interacts with database designers to reduce operational management costs and guarantee agreed upon service levels (e.g. response times and throughput rates). The DBA can contribute to data availability and accessibility, two other key data quality dimensions.
Data scientist is a relatively new job profile within the context of data management. He/she is responsible for analyzing data using state-of-the-art analytical techniques to provide new insights into e.g. customer behavior. A data scientist has a multidisciplinary profile combining ICT skills (e.g., programming) with quantitative modeling (e.g., statistics), business understanding, communication, and creativity. A good data scientist should possess sound programming skills in such languages as Java, R, Python, SAS, etc. The programming language itself is not that important, as long as the data scientist is familiar with the basic concepts of programming and knows how to use these to automate repetitive tasks or perform specific routines. Obviously, a data scientist should have a thorough background in statistics, machine learning and/or quantitative modeling. Essentially, data science is a technical exercise. There is often a huge gap between the analytical models and business users. To bridge this gap, communication and visualization facilities are key. A data scientist should know how to represent analytical models, accompanying statistics and reports in user-friendly ways by using traffic-light approaches, OLAP (on-line analytical processing) facilities, If-then business rules, etc. A data scientist needs creativity on at least two levels. On a technical level it is important to be creative with regard to data selection, data transformation and cleaning. The steps of the standard analytical process must be adapted to each specific application and the “right guess” could often make a big difference. Second, analytics is a fast-evolving field. New problems, technologies and corresponding challenges pop up on an ongoing basis. It is important that a data scientist keep up with these new evolutions and technologies and has enough creativity to see how they can yield new business opportunities. It is no surprise that these data scientist are hard to find in today’s job market. However, data scientists contribute to the generation of new data and/or insights, which could leverage new strategic business opportunities.
To conclude, ensuring high-quality data is multidisciplinary exercise combining various skills. In this article we reviewed the following data management job profiles from a data quality perspective: information architect, database designer, data owner, data steward, database administrator and data scientist.
For more information, we are happy to refer to our book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.