The paper contains 2601 words and is expected to last 5 minutes

Credit: unsplash.com/@gferla

Right now, the digital universe is catching up with the physical universe at a breakneck pace, with the amount of global data doubling every two years. It is estimated that by 2020, the digital universe will reach 44 zetabytes — the number of digital bits equal to the number of stars in the universe.

The amount of data increases over time

More and more distributed systems are on the market to process this data. Among many systems, Hadoop and Spark are often compared as direct competitors.

When deciding which framework is more appropriate, there are a few basic parameters to compare.

performance

Spark is very fast and superior to the Hadoop framework, with memory running 100 times faster and disks 10 times faster. In addition, sorting 100 TERabytes of data using a machine 10 times smaller than Spark is three times faster than Hadoop.

Hadoop vs Spark

Spark is fast because it processes data in memory. Spark’s memory processing can provide real-time analysis of data for marketing campaigns, Iot sensors, machine learning and social media sites.

However, if Spark and other services are running on YARN, Spark’s performance deteriorates, resulting in insufficient RAM memory. Hadoop solves this problem, and if users want to batch, Hadoop is much more efficient than Spark.

Note: Hadoop and Spark have different processing methods. Whether to use Hadoop or Spark in performance comparison depends on project requirements.

The transition between Facebook and the Spark framework

Data on Facebook is growing by the minute, and Facebook uses analytics to process and use that data to make decisions. It uses the following platforms to do so:

1. Hive platform for performing partial batch analytics on Facebook

2. Corona platform for Mapreduce custom implementation

3. Presto data query engine based on ANSI-SQL query

From a computing perspective, the Hive platform discussed above is resource-intensive and challenging to maintain. So Facebook decided to turn to Apache Spark to manage data. Currently, Facebook has deployed a faster, manageable channel for its entity ranking system through integration with Spark.

Facebook using the Spark framework

security

Spark’s security is still evolving and currently only supports authentication with a shared password (password authentication). Spache Spark’s website also states that “Spark still has different types of security issues that don’t protect everything.”

Hadoop, on the other hand, has the following security features: Hadoop authentication, authorization, auditing, and encryption. All of this is combined with Hadoop security projects like Knox Gateway and Sentry.

Key point: Spark is slightly less secure than Hadoop in security comparison. However, combine Hadoop with Spark, and Spark can have Hadoop’s security features.

The cost of

First, Both Hadoop and Spark are open source frameworks, so both are free, both use common servers, both run in the cloud, and seem to have some similar hardware requirements:

Hadoop vs.Spark specs

How do you evaluate it from a cost perspective?

Note that Spark uses a lot of RAM to run things in memory. RAM costs more than hard disks, so this can affect costs.

Hadoop, on the other hand, comes bundled with a hard disk, which saves the cost of buying RAM. However, Hadoop requires more systems to distribute disk I/O.

Therefore, organizations must take requirements into account when comparing the cost parameters of the Hadoop and Spark frameworks.

Hadoop is the right choice if your requirements tend to be large and historical, because the price of hard disk space is much lower than the price of memory space.

On the other hand, When dealing with real-time data, Spark is more cost-effective because it can perform the same task at a faster rate with less hardware.

Note: In the cost comparison between Hadoop and Spark, there is no doubt that Hadoop is cheaper. But Spark is more cost-effective when organizations have to process small amounts of real-time data.

Operational ease

Spark provides user-friendly apis for Scala Java, Python, and Spark SQL (also known as Shark).

Spark’s simple building blocks make writing user-defined functions much easier. In addition, Spark’s support for batch processing and machine learning makes it easy to simplify the data processing infrastructure. Spark also includes an interactive mode for running instant feedback commands.

Hadoop is written in Java. Hadoop is notoriously difficult to use when writing programs without interactive mode, because it makes writing programs very difficult. While Pig (an add-on tool) makes programming easier, it takes time to learn the syntax.

Key point: In comparison of Hadoop and Spark’s ease of use, both have their own user-friendly approach. But if you have to choose between the two, Spark is both easier to program and interactive.

Is it possible for Apache Hadoop and Spark to collaborate?

Credit: unsplash.com/@starks73

We’re excited about this possibility, so let’s look at how collaboration might work.

The Apache Hadoop ecosystem includes HDFS, Apache Query, and HIVE. How does Apache leverage them?

Merge Apache Spark and HDFS

The purpose of Apache Spark is to process data. But in order to process the data, the engine needs to input data from storage, and Spark uses HDFS for this purpose (this option is not unique, but the most popular, since Apache is the brain behind it).

Apache HIVE and Apache Spark

Apache HIVE and Apache Spark are highly compatible. The combination of the two can solve many service problems.

For example, a company is analyzing consumer behavior and needs to collect data from a variety of sources such as social media, reviews, click-stream data, customer mobile applications, and so on.

Enterprises can use HDFS to store data. Apache HIVE serves as a bridge between HDFS and Spark.

Uber and its merger

Uber using Hadoop and Spark

Uber combines Spark and Hadoop to process consumer data, using real-time traffic conditions to provide drivers at a specific place and time. To achieve this, Uber uses HDFS to upload raw data to HIVE and Uses Spark to process billions of events.

In the comparison between Hadoop and Spark, the winner is……

Although Spark is fast and easy to operate, Hadoop provides high security, large storage capacity, and low cost for batch processing. Which one to choose depends entirely on the requirements of the project, and a combination of the two can work.

Combine Spark with some of Hadoop’s attributes to form a new framework: Spoop.

Leave a comment like follow

We share the dry goods of AI learning and development. Welcome to pay attention to the “core reading technology” of AI vertical we-media on the whole platform.



(Add wechat: DXSXBB, join readers’ circle and discuss the freshest artificial intelligence technology.)