Beginner's Guide to understand Hadoop and Spark in Data Science

By POOJA BISHT |Email | Mar 25, 2019 | 12012 Views

With a million bytes of data produced every day and the arrival of the concept of Big Data, the need for its management and analyzation struck to the business minds in the past. Businesses are daily and at each moment collecting, cleaning, processing and analyzing this data to create a successful impact on their businesses. The use of analyzing big data and much of its applications in Healthcare, Financial, Recommender systems, customer satisfaction etc. is already known to us (if not known then please refer to previous articles on Data science, Data Analytics, and Big Data). In this article, we are here to discuss two software which is used by almost every one of the top businesses today and has gained much popularity in recent times. 

We will discuss Apache Hadoop and Apache Spark here and will try to compare both with an intention to give you a fair idea of this two software. It will take you a step ahead in the field of Data Science which is a buzz these days.

  • Hadoop and Spark both are used by businesses today to process big data. Big data signifies to a large amount of data that is created at every moment in terms of your online purchases, your searches, social networking sites or any in the digital world.

  • Both Apache Spark and Hadoop are open source software framework in a way that their source is available free to everyone and only infrastructure costs are there in terms of running them in any hardware or any platform. Hadoop processes data in parallel across a cluster of computers by distributing files across various nodes in a cluster.

  • While Hadoop consists of Hadoop Distributed File System (HDFS) for storage and provide storage in a distributed way, there is no storage available in Spark and for the same reason, Spark is sometimes used with Hadoop or any other cloud service for storage. 

  • The processing speed is quite slower in Hadoop in comparison to Spark. "The MapReduce workflow looks like this: Read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc." -by Kirk Borne, Principal Data Scientist at Booz Allen Hamilton, while this is not the case with Spark where the feature of in-memory cluster computing is present and only one step is involved, thus there is a faster processing in Spark as compared to Hadoop.

  • Apart from HDFS, Hadoop has its MapReduce Programming model for processing large datasets.

  • Hadoop is efficiently used for Batch processing data (non-real or with minimal human interactions i.e., not in real time) while Spark is efficient for handling real-time data. So stands somewhere apart from Hadoop in this respect. 

  • Comparing with the costs incurred, costs involved in setting the Spark system are more than in Hadoop.

Source: HOB