Jyoti Nigania

Hi,i am writing blogs for our platform House of Bots on Artificial Intelligence, Machine Learning, Chatbots, Automation etc after completing my MBA degree. ...

Hi,i am writing blogs for our platform House of Bots on Artificial Intelligence, Machine Learning, Chatbots, Automation etc after completing my MBA degree.

Highest Demanding Job Roles in the Year 2018
3 days ago

Skills Required For Data Scientist
4 days ago

Best way to start learning Robotics
4 days ago

What Business Intelligence is actually? Need to Know
4 days ago

Views of Elon Musk and Mark Zukerberg on the Leading Topic "Artificial Intelligence"
5 days ago

Scope of AI and Machine learning in India
25620 views

Skills Required To Become Data Scientists
13266 views

Difference between Artificial Intelligence, Machine Learning and Deep Learning
8169 views

Differentiating between Data Science, Big Data and Data Analytics
7227 views

AI Race 2018: Top 10 Indian Startups
6852 views

Know First Step of Spark Program

Jul 5, 2018 | 471 Views

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

History of Spark:
Spark as a data processing framework was developed at UC Berkeley's AMPLab by MateiZaharia in 2009. In 2010 it became an open source project under a Berkeley software distribution license. In year 2013 the project was donated to the Apache Software foundation and the license was changed to Apache 2.0. In February 2014 spark became an Apache top level project used by Databricks to set a world record in large-scale sorting in November. But now at present it exists as a next generation real-time and batch processing framework.


Understand Batch V/s Real Time Processing 
The features below show a comparison of batch and real time analytics in the enterprise use cases:

Batch Processing:
  • Large group of data or transactions is processed in a single run. 
  • Jobs run without any manual intervention.
  • The entire data is pre-selected and fed using command-line parameters and scripts.
  • It is used to execute multiple operations, handle heavy data load, reporting and offline data workflow. 
Example: Regular reports requiring decision making. 

Real-Time Processing:
  • Data processing takes place upon data entry or command receipt instantaneously. 
  • It must execute on response time within stringent constraints.
Example: Fraud detection. 


The need of Spark was created by the limitations of the MapReduce which is another data processing framework in Handoop. So following are the limitations of MapReduce which give rise to Spark:

  • Unsuitable in real-time processing: Being batch oriented, it takes minutes to execute jobs depending upon the amount of data and number of nodes in the cluster.
  • Unsuitable for trivial operations: For operations like filter and joins, you might need to rewrite the jobs, which becomes complex because of the key value pattern.
  • Unfit for large data on network: However, it works on the data locality principle it cannot process a lot of data requiring shuffling over the network well.
  • Unsuitable with online transaction processing (OLTP): OLTP requires a large number of short transactions, as it works on the batch-oriented framework.
  • Unfit for processing graphs: The Apache Graph library processes graphs, which adds additional complexity on the top of MapReduce.
  • Unfit for iterative execution: Being a state-less execution, MapReduce doesn't fit with use cases like K means that need iterative execution. 

What is Spark?
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. This addresses all the limitations of MapReduce.
  • Spark is an open source cluster computing framework.
  • It is suitable for real time processing, trivial operations and processing large data on network.
  • Provides up to 100 times faster performance for a few applications with in-memory primitives, as compared to the two stage disk based MapReduce paradigm of Handoop.
  • Is also suitable for machine learning algorithms, as it allows programs to load and query data repeatedly.

Components of Spark: 
Following are the components of Spark:
  1. Spark Core and RDDs
  2. Spark SQL
  3. Spark Streaming
  4. MLlib
  5. GraphX
And Spark support to various development languages like Java, Scala, Python, R. 

Advantages of Spark:
  • Speed is what extending the MapReduce model to support computations like stream processing and interactive queries.
  • Combination means covering various workloads that used to require different distributed systems, which makes combining different processing types and allows easy tools management.
  • It contains various closely integrated components for distributing, scheduling and monitoring applications with many computational tasks. 
  • Empowers various higher-level components specialized for different workloads like machine learning or SQL.
  • Allows accessing the same data through the Python shell for ad-hoc analysis and in standalone batch applications.
Hence, Spark is a very popular computer framework. 





Source: HOB
Jyoti Nigania

Hi,i am writing blogs for our platform House of Bots on Artificial Intelligence, Machine Learning, Chatbots, Automation etc after completing my MBA degree. ...

Full Bio 

Hi,i am writing blogs for our platform House of Bots on Artificial Intelligence, Machine Learning, Chatbots, Automation etc after completing my MBA degree.

Highest Demanding Job Roles in the Year 2018
3 days ago

Skills Required For Data Scientist
4 days ago

Best way to start learning Robotics
4 days ago

What Business Intelligence is actually? Need to Know
4 days ago

Views of Elon Musk and Mark Zukerberg on the Leading Topic "Artificial Intelligence"
5 days ago

Scope of AI and Machine learning in India
25620 views

Skills Required To Become Data Scientists
13266 views

Difference between Artificial Intelligence, Machine Learning and Deep Learning
8169 views

Differentiating between Data Science, Big Data and Data Analytics
7227 views

AI Race 2018: Top 10 Indian Startups
6852 views