Artificial Intelligence And Its Genre
243 days ago
Must Aware About The Data Mining Techniques
244 days ago
Listed Key Characteristics Of Cloud Computing
522 days ago
Python Opens The Door For Computer Programming
Introduction to Spark
- Large group of data or transactions is processed in a single run.
- Jobs run without any manual intervention.
- The entire data is pre-selected and fed using command-line parameters and scripts.
- It is used to execute multiple operations, handle heavy data load, reporting and offline data workflow.
- Data processing takes place upon data entry or command receipt instantaneously.
- It must execute on response time within stringent constraints.
- Unsuitable in real-time processing: Being batch oriented, it takes minutes to execute jobs depending upon the amount of data and number of nodes in the cluster.
- Unsuitable for trivial operations: For operations like filter and joins, you might need to rewrite the jobs, which becomes complex because of the key value pattern.
- Unfit for large data on network: However, it works on the data locality principle it cannot process a lot of data requiring shuffling over the network well.
- Unsuitable with online transaction processing (OLTP): OLTP requires a large number of short transactions, as it works on the batch-oriented framework.
- Unfit for processing graphs: The Apache Graph library processes graphs, which adds additional complexity on the top of MapReduce.
- Unfit for iterative execution: Being a state-less execution, MapReduce doesn't fit with use cases like K means that need iterative execution.
- Spark is an open source cluster computing framework.
- It is suitable for real time processing, trivial operations and processing large data on network.
- Provides up to 100 times faster performance for a few applications with in-memory primitives, as compared to the two stage disk based MapReduce paradigm of Handoop.
- Is also suitable for machine learning algorithms, as it allows programs to load and query data repeatedly.
- Spark Core and RDDs
- Spark SQL
- Spark Streaming
- Speed is what extending the MapReduce model to support computations like stream processing and interactive queries.
- Combination means covering various workloads that used to require different distributed systems, which makes combining different processing types and allows easy tools management.
- It contains various closely integrated components for distributing, scheduling and monitoring applications with many computational tasks.
- Empowers various higher-level components specialized for different workloads like machine learning or SQL.
- Allows accessing the same data through the Python shell for ad-hoc analysis and in standalone batch applications.