QA: I’ve heard Hadoop being mentioned for storage, distributed jobs, queries and data analytics… So what is Hadoop and what is it not?

By: Bart Baesens, Seppe vanden Broucke

This QA first appeared in Data Science Briefings, the DataMiningApps newsletter as a “Free Tweet Consulting Experience” — where we answer a data science or analytics question of 140 characters maximum. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.

You asked: I’ve heard Hadoop being mentioned for storage, distributed jobs, queries and data analytics… So what is Hadoop and what is it not?

Our answer:

A good question indeed. We all know of Hadoop and hear it being mentioned every time a team is talking about some big daunting task related to the analysis or management of big data. Have lots of volume? Hadoop! Unstructured data? Hadoop! Streaming data? Hadoop! Want to run super-fast machine learning in parallel? You guessed it… Hadoop!

Hadoop started as part of a search engine called Nutch, which was being worked on by Doug Cutting and Mike Cafarella. In 2006, Cutting joined Yahoo! to work in its search engine division. The part of Nutch which dealt with distributed computing and processing (initially constructed to handle with the simultaneous parsing of enormous amounts of web links in an efficient manner) was split of and renamed to “Hadoop”. In 2008, Yahoo! open-sourced Hadoop, and today, Hadoop is part of an ecosystem of technologies which are managed by the non-profit Apache Software foundation.

Even when talking about “raw” Hadoop, it is important to know that it describes a stack containing (as of this moment) four core modules: Hadoop Common (a set of shared libraries), Hadoop Distributed File System (HDFS), a Java-based file system to store data across multiple machines, MapReduce (a programming model to process large sets of data in parallel) and YARN, a framework to schedule and handle resource requests in a distributed environment.

In itself, the core Hadoop stack is actually pretty bare-bones. It provides you a with distributed file system and a framework to construct parallelized programs. Definitely not a turn-key solution for most environments. As such, many implementations and vendors also mix-in a number of additional projects such as:

  • Pig: a framework to manipulate data stored in HDFS without having to write complex MapReduce programs from scratch;
  • Hive: a data warehouse solution with SQL like query capabilities to present data in the form of tables;
  • HBase: a distributed database which runs on top of the Hadoop core stack;
  • Cassandra: another distributed database;
  • Ambari: a web interface for managing Hadoop stacks;
  • Flume: a framework to collect and deal with streaming data intakes;
  • Oozie: a job scheduler;
  • Zookeeper: a framework to coordinate distributed processes;
  • Sqoop: a connector to move data between Hadoop and relational databases;
  • Spark: a computing framework geared towards data analytics.

In many cases, when big vendors (Cloudera, IBM, Oracle) talk about their Hadoop-based solution, they typically describe a combination of the project listed above, modified to suit particular needs, extended by proprietary modules, wrapped up and oftentimes packaged together with easier-to-use web interfaces to handle management, authentication, and monitoring.

What is clear from the above is that it is important to keep in mind that whenever Hadoop is mentioned, all stakeholders should be aware of which modules will be included, since these immediately impact the usability and features of the stack in your environment. For instance, in a typical data science oriented set up, operability with relational databases, SQL alike querying and the provision of machine learning algorithms is what matters. Finally, from a data science perspective, it is good to know that many companies exist which are providing additional libraries and implementations on top of the modules mentioned above. The current hype du jour leans towards a Spark-based stack (Hadoop with Hive and HBase and Spark) with additional libraries such as H2O, which can run on top of such clusters.

The key thing to keep in mind however is to make a careful analysis of whether you really need a Hadoop-based environment. Dealing with distributed analytics can be tricky at first, and in some cases, it can make more sense to start with in-memory setups. Machines offering up to 6TB of memory exist, are relatively affordable (when compared to setting up a Hadoop cluster), and allow you to perform advanced analyses with the same or similar tooling as you’re already using.

P.S.: Don’t forget to submit your suggestions for questions!