Data Science: A Team Spirit
56 days ago
Python Opens The Door For Computer Programming
Introduction to Spark
- Large group of data or transactions is processed in a single run.
- Jobs run without any manual intervention.
- The entire data is pre-selected and fed using command-line parameters and scripts.
- It is used to execute multiple operations, handle heavy data load, reporting and offline data workflow.
- Data processing takes place upon data entry or command receipt instantaneously.
- It must execute on response time within stringent constraints.
- Unsuitable in real-time processing: Being batch oriented, it takes minutes to execute jobs depending upon the amount of data and number of nodes in the cluster.
- Unsuitable for trivial operations: For operations like filter and joins, you might need to rewrite the jobs, which becomes complex because of the key value pattern.
- Unfit for large data on network: However, it works on the data locality principle it cannot process a lot of data requiring shuffling over the network well.
- Unsuitable with online transaction processing (OLTP): OLTP requires a large number of short transactions, as it works on the batch-oriented framework.
- Unfit for processing graphs: The Apache Graph library processes graphs, which adds additional complexity on the top of MapReduce.
- Unfit for iterative execution: Being a state-less execution, MapReduce doesn't fit with use cases like K means that need iterative execution.
- Spark is an open source cluster computing framework.
- It is suitable for real time processing, trivial operations and processing large data on network.
- Provides up to 100 times faster performance for a few applications with in-memory primitives, as compared to the two stage disk based MapReduce paradigm of Handoop.
- Is also suitable for machine learning algorithms, as it allows programs to load and query data repeatedly.
- Spark Core and RDDs
- Spark SQL
- Spark Streaming
- Speed is what extending the MapReduce model to support computations like stream processing and interactive queries.
- Combination means covering various workloads that used to require different distributed systems, which makes combining different processing types and allows easy tools management.
- It contains various closely integrated components for distributing, scheduling and monitoring applications with many computational tasks.
- Empowers various higher-level components specialized for different workloads like machine learning or SQL.
- Allows accessing the same data through the Python shell for ad-hoc analysis and in standalone batch applications.