Technical Content Writer, currently writing content for House of Bots. ...Full Bio
Technical Content Writer, currently writing content for House of Bots.
What does it take to be a Data Scientist?
Beginner's Guide to understand Hadoop and Spark in Data Science
- Hadoop and Spark both are used by businesses today to process big data. Big data signifies to a large amount of data that is created at every moment in terms of your online purchases, your searches, social networking sites or any in the digital world.
- Both Apache Spark and Hadoop are open source software framework in a way that their source is available free to everyone and only infrastructure costs are there in terms of running them in any hardware or any platform. Hadoop processes data in parallel across a cluster of computers by distributing files across various nodes in a cluster.
- While Hadoop consists of Hadoop Distributed File System (HDFS) for storage and provide storage in a distributed way, there is no storage available in Spark and for the same reason, Spark is sometimes used with Hadoop or any other cloud service for storage.
- The processing speed is quite slower in Hadoop in comparison to Spark. "The MapReduce workflow looks like this: Read data from the cluster, perform an operation, write results to the cluster, read updated data from the cluster, perform next operation, write next results to the cluster, etc." -by Kirk Borne, Principal Data Scientist at Booz Allen Hamilton, while this is not the case with Spark where the feature of in-memory cluster computing is present and only one step is involved, thus there is a faster processing in Spark as compared to Hadoop.
- Apart from HDFS, Hadoop has its MapReduce Programming model for processing large datasets.
- Hadoop is efficiently used for Batch processing data (non-real or with minimal human interactions i.e., not in real time) while Spark is efficient for handling real-time data. So stands somewhere apart from Hadoop in this respect.
- Comparing with the costs incurred, costs involved in setting the Spark system are more than in Hadoop.