Contributed by: Wilfried Lemahieu, Seppe vanden Broucke, Bart Baesens
This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at firstname.lastname@example.org and let’s get in touch!
This article is based on our upcoming book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.
Many organizations nowadays are struggling with finding the appropriate data stores for their data. In this article, we zoom in on some key data structures to facilitate corporate decision making by means of business intelligence. More specifically, we discuss data warehouses, data marts, operational data stores and data lakes and indicate their differences and similarities.
A data warehouse was first formally defined by Bill Inmon as :
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”
Subject-oriented implies that the data is organized around subjects such as customers, products, sales, etc. The data warehouse is integrated in the sense that it integrates data from a variety of operational sources and a variety of formats such as RDBMSs, legacy DBMSs, flat files, etc. Time variant refers to the fact that the data warehouse essentially stores a time series of periodic snapshots. Operational data is always up-to-date and represents the most recent state of the data elements, whereas a data warehouse is not necessarily up-to-date but represents the state at some specific moment(s) in time. Non-volatile implies that the data is primarily read-only, and will thus not be frequently updated or deleted over time. Hence, the two most important types of data manipulation operations for a data warehouse are: data loading and data retrieval.
A data mart is a scaled down version of a data warehouse aimed at meeting the information needs of a homogeneous small group of end-users such as a department or business unit (e.g., marketing, finance, logistics, HR, etc.). It typically contains some form of aggregated data and is used as the primary source for report generation and analysis by this end-user group. There are various reasons for setting up data marts. First of all, they provide focused content such as finance, sales or accounting information, in a format tailored to the user group at hand. They also improve query performance by offloading complex queries, and therefore workload, from other data sources (e.g., a data warehouse). Data marts can be located closer to the end-users, hereby alleviating heavy network traffic and giving them more control. Finally, certain reporting tools assume predefined data structures which can be provided by a customized data mart. In order to denote the contrast with a data mart, a full blown data warehouse is often called an enterprise data warehouse (EDW) to emphasize the organization wide aspect.
An Operational Data Store (ODS) is another way of dealing with the disadvantage of data warehouses not containing up to date data. An ODS can be considered as a staging area that provides query facilities. A normal staging area is only meant for receiving the operational data from the transactional sources for the sake of transforming the data and loading it into the data warehouse. An ODS also offers this functionality, but in addition it can be queried directly. In this way, analysis tools that need data that is closer to real time, can query the ODS data as it is received from the respective source systems, before time consuming transformation and loading operations. The ODS then only provides access to the current, fine grained and non-aggregated data, which can be queried in an integrated manner without burdening the transactional systems. However, more complex analyses requiring high-volume historical and/or aggregated data are still conducted on the actual data warehouse.
|Data Warehouse||Data Lake|
|Transformation||Before entering the DW||Before analysis|
|Users||Decision makers||Data Scientists|
To summarize, in this article we highlighted the differences and similarities between data warehouses, data marts, operational data stores and data lakes. All these data structures clearly serve different purposes and user profiles and it is important to be aware of their differences in order to make the right investment decisions.
 W.H. Inmon, Building the Data Warehouse, 2nd edition, John Wiley & Sons, New York, 1996.
For more information, we are happy to refer to our book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.
Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail (just reply to this one) and let’s get in touch!