The Spark program

About the author

  • The authors introduce

🍓 blog home page: author’s home page 🍓 Introduction: JAVA quality creator 🥇, a junior student 🎓, participated in various provincial and national competitions during school, and won a series of honors


1.1 The era of big data

1.1.1 The Third informatization wave

The third information wave

1.1.2 Information technology provides technical support for the era of big data

The capacity of the storage device continues to increase

CPU processing capacity has been greatly improved

Network bandwidth continues to increase

1.1.3 The change of data generation mode contributes to the advent of big data era

1.2 Big data concepts

1.2.1 Large Amount of Data

  • According to estimates made by IDC, data has been growing at a rate of 50% per year, that is to say, doubling every two years (Moore’s Law of Big Data)
  • Humanity has generated as much data in the last two years as it ever did before
  • By 2020, the world is expected to have 35 gigabytes of data, an increase of nearly 30 times compared to 2010

1.2.2 Various data types

  • Big data is made up of structured and unstructured data
    • 10% of structured data is stored in a database
    • 90% of unstructured data, which is closely related to human information

  • Scientific research

Genome, LHC accelerator, Earth, and space exploration

  • Enterprise application

Email, documents, files, application logs, transaction records

  • Web 1.0 data

Text, image, video

  • Web 2.0 data

**/** Click stream, Twitter/ Blog/SNS, Wiki

1.2.3 Fast processing speed

  • The time window from data generation to consumption is so small that very little time is available to generate decisions
  • The one-second rule: This is also fundamentally different from traditional data mining techniques

1.2.4 Low value density

Low value density, high commercial value

In the case of video, continuous monitoring may only have a second or two of useful data, but it has high commercial value

1.3 Influence of big data

Dr. Jim Gray, a Turing Award winner and renowned database expert, observed and summarized the four paradigms of experiment, theory, computation and data in human scientific research since ancient times

  • In terms of way of thinking, big data has completely overturned the traditional way of thinking
    • Full sample rather than sampling
    • Efficiency over accuracy
    • Correlation, not causation

1.4 Key technologies of big data

Different aspects of big data technology and their functions

Two core technologies

1.5 Big data computing mode

1.6 Representative big data technology

1.6.1 Hadoop

Hadoop – graphs

  • MapReduce abstracts complex, parallel computing processes running on large clusters into two functions: Map and Reduce
  • Programming is easy, do not need to master the details of distributed parallel programming, it is also easy to run their own programs on the distributed system, complete the calculation of massive data
  • MapReduce adopts a “divide and conquer” strategy. A large data set stored in a distributed file system is split into many independent fragments, which can be processed by multiple Map tasks in parallel.

Hadoop – YARN

The goal of YARN is to achieve “multiple frameworks in one cluster”. Why?

  • There are different business application scenarios in an enterprise, which require different computing frameworks

    • MapReduce implements offline batch processing
    • Impala is used for real-time interactive query analysis
    • Real-time streaming data analysis using Storm
    • Spark is used for iterative calculation
  • These products usually come from different development teams and have their own resource scheduling mechanisms

  • In order to avoid mutual interference between different types of applications, enterprises need to divide their internal servers into multiple clusters, and install and run different computing frameworks respectively, that is, “one framework one cluster”.

  • Lead to problems

    • The cluster resource usage is low
    • Data cannot be shared
    • High maintenance cost
  • The goal of YARN is to implement multiple frameworks in one cluster. That is, one unified resource scheduling management framework YARN can be deployed on one cluster, and other computing frameworks can be deployed on YARN

  • YARN provides unified resource scheduling and management services for these computing frameworks and adjusts the resources occupied by each computing framework based on the load requirements to achieve cluster resource sharing and resource elastic shrinkage

  • Different application loads can be mixed and matched in a cluster, effectively improving the utilization rate of the cluster

  • Different computing frameworks can share the underlying storage, preventing data sets from moving across clusters

1.6.2 Spark

Comparison between Hadoop and Spark

  • Hadoop has the following disadvantages: Limited presentation capability
  • Disk I/O overhead is high
  • High latency
  • The interface between tasks involves IO overhead
  • Other tasks cannot start before the completion of the previous task, making it difficult to perform complex, multi-stage computing tasks.

Comparison between Hadoop and Spark

  • Spark uses the advantages of Hadoop and MapReduce to solve the problems faced by MapReduce

Compared with Hadoop MapReduce, Spark has the following advantages:

  • Spark’s calculation mode is also MapReduce, but not limited to Map and Reduce operations. It also provides multiple types of data set operations and has a more flexible programming model than Hadoop MapReduce
  • Spark provides in-memory calculation. Intermediate results can be stored in memory, which is more efficient for iterative calculation
  • The DaG-based task scheduling and execution mechanism of Spark is superior to the iterative execution mechanism of Hadoop and MapReduce

1.6.3 Flink

The performance comparison

First of all, they can perform real-time calculations based on the in-memory computing framework, so they have very good computing performance. After testing, Flink has a slightly better computational performance.

Both Spark and Flink run on Hadoop YARN, and the performance is Flink > Spark > Hadoop(MR). The more iterations, the more obvious the performance is. The main reason why Flink is better than Spark and Hadoop is that Flink supports incremental iteration. Automatic optimization of iterations.

Comparison of flow calculation

Both of them support streaming computing. Flink processes row by row, while Spark processes small batches based on RDD. Therefore, Spark inevitably increases some latency in streaming processing. Flink has similar streaming performance to Storm, supporting millisecond calculations, while Spark only supports second calculations.

SQL support

Both support SQL. Spark supports SQL in a wider range than Flink. Spark supports SQL optimization, while Flink supports API-level optimization.

1.6.4 Beam

After the language

The original intention of the director to write blog is very simple, I hope everyone in the process of learning less detours, learn more things, to their own help to leave your praise 👍 or pay attention to ➕ are the biggest support for me, your attention and praise to the director every day more power.

If you don’t understand one part of the article, you can reply to me in the comment section. Let’s discuss, learn and progress together!

Wechat (Z613500) or QQ (1016942589) for detailed communication.