Hadoop and Spark are the two main infrastructure framework available to store and process all of your big data. When Spark emerged in 2014, many predicted it would overtake Hadoop as the leading service, but the two both remain widely used. The biggest reason for this is the huge differences between the two – including file organisation, the way they process data, and how they cope with system failures.
File Organisation
Hadoop contains a file system not dissimilar to the one found on your average desktop computer, but it has the added function of allowing you to distribute files across many machines, this is known as the Hadoop Distributed File System (HDFS). This file distribution system is something that Spark lacks, and is one of the factors explaining why it hasn’t completely overtaken its older competitor.
Processing Data
Many argue that Spark works at a much faster rate than Hadoop. This is down to the way the systems process the data that they store. Hadoop’s file distribution system uses MapReduce in order to process and analyse data. MapReduce takes a back-up of all the data in a physical server, rather than storing it in RAM. This is because data stored in RAM is far more volatile than that stored in a physical server.
Spark’s quicker speed may come from the fact that it copies most of the data from a physical server to a RAM memory, this is what is known as an ‘in memory’ operation. This makes it quicker than Hadoop’s MapReduce system as it greatly reduces the time interacting with servers.
Failure Recovery
Both offer equally good systems for recovery after a failure, albeit different ones. Hadoop has a natural resilience to system failures as the data is written into the disk after every operation. In comparison, Spark has a similar built-in resiliency due the fact that it’s data objects are stored in something called resilient distributed datasets which are distributed across the data cluster.
Better Together
Whilst many people view Hadoop and Spark as direct competitors, more and more companies are using a combination of the two in order to effectively store, process, and subsequently utilise their big data. You can, of course, use both without the other – your choice will depend on the needs of your business.
Discover more about how Apache Hadoop and Spark can help your business to become data-driven by registering free for Big Data LDN 2017