Welcome to IE-LAB!

Generic filters
Generic filters

The similarities and differences between Hadoop and Spark

For those who study big data, there must be some knowledge of Hadoop and Apache Spark. But what do they all have in common and different places? Let’s come down and analyze:

  • Hadoop

Hadoop is a distributed system infrastructure developed by the Apache Foundation. A user can develop a distributed program without knowing the low-level details of the distribution. Make full use of the power of the cluster for high-speed computing and storage. The core design of the Hadoop framework is HDFS and MapReduce. HDFS provides for the storage of massive amounts of data, and MapReduce provides for the computation of massive amounts of data.

  • Spark

Spark is a platform for fast and generic cluster computing. In terms of speed, Spark extends the widely used MapReduce computing model and efficiently supports more computing modes, including interactive queries and stream processing.

The Spark project contains multiple components that are tightly integrated. The core of Spark is a computing engine that schedules, distributes, and monitors applications that run on multiple machines or on a computing cluster, and is composed of many computing tasks.

Different objectives:

Both Hadoop and Spark are big data frameworks, but each has a different purpose. Hadoop is a distributed data infrastructure: it dispatches huge data sets to multiple nodes in a cluster of ordinary computers for storage. At the same time, Hadoop also indexes and tracks this data. Spark is a tool for processing large data in distributed storage, and it does not store distributed data.


The core framework of Hadoop is HDFS distributed data storage function and MapReduce data processing function. So here we can get rid of Spark and use Hadoop’s own MapReduce to do the processing of the data.

Spark does not provide a file management system, so it must be integrated with other distributed file systems.

You can also choose other cloud-based data system platforms. But Spark is still used by default on Hadoop.

data processing

Spark processes data in full seconds, killing MapReduce because of the different ways in which it is processed. MapReduce processes the data step by step, while Spark processes the batch nearly 10 times faster than MapReduce, and the in-memory data analysis is nearly 100 times faster.

Data recovery:

Both have very different disaster recovery methods, and Hadoop writes every time it processes data to disk, so it has a very flexible way to handle system errors. Spark s data objects are stored in a data cluster called an elastic distributed data set RDD: Resilient Distributed Dataset) In. These data objects can be placed on both memory and disk, so RDD can also provide disaster recovery for completion.

For more articles you can follow us on:

error: Content is protected !!
× How can I help you?