Many companies are investing in data lakes these days. What is the difference with a data warehouse?

By: Bart Baesens, Seppe vanden Broucke

This QA first appeared in Data Science Briefings, the DataMiningApps newsletter. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.

You asked: Many companies are investing in data lakes these days. What is the difference with a data warehouse?

Our answer:

A key distinguishing property of a data lake is that it stores raw data in its native format, which could be structured, unstructured or semi-structured. This makes data lakes fit for more exotic and ‘bulk’ data types that we generally do not find in data warehouses, such as social media feeds, clickstreams, server logs, sensor data, etc. A data lake collects data emanating from operational sources ‘as is’, often without knowing upfront which analyses will be performed on it, or even whether the data will ever be involved in analysis at all. For this reason, either no or only very limited transformations (formatting, cleansing, …) are performed on the data before it enters the data lake. Consequently, when the data is tapped from the data lake to be analyzed, quite a bit of processing will typically be required before it is fit for analysis. The data schema definitions are only determined when the data is read (schema-on-read) instead of when the data is loaded (schema-on-write) as is the case for a data warehouse. Storage costs for data lakes are also relatively low because most of the implementations are open-source solutions that can be easily installed on low-cost commodity hardware. Since a data warehouse assumes a predefined structure, it is less agile compared to a data lake which has no structure. Also, data warehouses have been around for quite some time already, which automatically implies that their security facilities are more mature. Finally, in terms of users, a data warehouse is targeted towards decision makers at middle and top management level, whereas a data lake requires a data scientist, which is a more specialized profile in terms of data handling and analysis.