This is the second day of my participation in Gwen Challenge

preface

Welcome to our GitHub repository Star: github.com/bin39232820… The best time to plant a tree was ten years ago, followed by now

Where t

After a brief introduction to Scala in the last article, let’s dive into Spark today

Spark distributed cluster environment construction

  • Spark distributed cluster environment construction

Introduction of the Spark

Spark is a general-purpose parallel memory computing framework developed by Algorithms, Machines, and People Lab at the University of California, Berkeley. In June 2013, Spark entered Apache as an incubator project, and eight months later, it became an Apache top project. The speed is remarkable. With its advanced design concept, Spark quickly became a popular project in the community. Spark SQL, Spark Streaming, MLLib, and GraphX, also known as BDAS (Berkeley Data Analytics Stack), have been developed around Spark to form a one-stop solution platform for big data processing. From all reports, Spark’s ambition is not to replace Hadoop as the mainstream standard for big data processing, but it still has a long way to go before it has many big projects to test. Spark is implemented in Scala, an object-oriented, functional programming language that can manipulate distributed data sets as easily as local collection objects. (Scala provides a parallel model called actors, where actors send and receive asynchronous information through their inboxes rather than sharing data, This approach is called the Shared Nothing model. According to the Spark website, it is fast, easy to use, versatile, and can run anywhere.

  • Fast running speed

Park has a DAG execution engine that supports iterative computation of data in memory. Official data shows that data can be read from disk more than 10 times faster than Hadoop MapReduce, and data can be read from memory more than 100 times faster.

  • Ease of use is good

Spark supports not only Scala but also Java and Python for application programming. Scala is an efficient and extensible language that can handle complex processing tasks with simple codes.

  • Strong commonality

Spark Ecosphere, or BDAS (Berkeley Data Analysis Stack) contains Spark Core, Spark SQL, Spark Streaming, MLLib, and GraphX. These components handle Spark Core’s in-memory computing framework, SparkStreaming’s real-time processing applications, Spark SQL’s ad-hoc queries, MLlib or MLbase’s machine learning, and GraphX’s graph processing, all provided by AMP LABS. Seamless integration and provide a one-stop solution platform.

Spark is different from Hadoop

Spark is developed on the basis of MapReduce, inheriting the advantages of distributed parallel computing and improving the obvious defects of MapReduce as follows:

First of all, Spark stores intermediate data in memory, enabling efficient iteration. Calculation results in MapReduce need to be stored on disks, which may affect the overall speed. Spark supports the DAG diagram distributed parallel computing programming framework, which reduces data landing during iteration and improves processing efficiency.

Second, Spark has high fault tolerance. Spark introduces the Resilient Distributed Dataset (RDD) abstraction. It is a collection of read-only objects Distributed in a group of nodes. The collection is elastic. They can be reconstructed in terms of “pedigree”, which allows for data-based derivation processes. In addition, error tolerance can be realized through CheckPoint during RDD calculation. There are two methods of CheckPoint: CheckPoint Data and Logging The Updates. Users can control which method is used to achieve error tolerance.

Finally, Spark is more versatile. Unlike Hadoop, which provides only Map and Reduce operations, Spark provides a wide variety of operations on data sets, ranging from Transformations to Actions. You can perform a variety of operations including Map, Filter, FlatMap, Sample, GroupByKey, ReduceByKey, Union, Join, Cogroup, MapValues, Sort and PartionBy. It also provides Count and Actions including Collect, Reduce, Lookup, and Save. In addition, the communication model between each processing node is no longer Shuffle mode like Hadoop. Users can name, materialize, and control the storage and partitioning of intermediate results. 1.3 Application scenarios of Spark

Application scenarios of Spark

Currently, there are the following types of big data processing scenarios:

  • Complex Batch Data Processing focuses on the ability to process massive Data. As for the tolerable Processing speed, the usual time may be in tens of minutes to several hours.
  • Interactive queries based on historical data typically take tens of seconds to tens of minutes
  • The Streaming Data Processing based on real-time Data stream usually takes hundreds of milliseconds to several seconds. At present, there are relatively mature Processing frameworks for the above three scenarios

In the first case, Hadoop’s MapReduce can be used for mass data processing, in the second case, Impala can be used for interactive queries, and in the third case, Storm distributed processing framework can be used for real-time streaming data processing. These three items are independent and have high maintenance costs. Spark can meet these requirements with a one-stop platform.

Based on the above analysis, the Spark scenarios are as follows:

LSpark is a memory-based iterative computing framework suitable for applications that require multiple operations on specific data sets. The more repeated operations are required, the larger the amount of data to be read, and the greater the benefit. The smaller the data volume but the more computationally intensive scenario, the smaller the benefit l Due to the characteristics of RDD, Spark is not suitable for asynchronous fine-grained status update applications, such as the storage of Web services or incremental Web crawlers and indexes. The incremental modification of the application model is not suitable for l data volume is not particularly large, but requires real-time statistical analysis requirements

The ecological system

The Spark Ecosystem, also known as BDAS (Berkeley Data Analytics Stack), is a platform created by Berkeley’S APMLab to showcase big data applications through large-scale integration among Algorithms, Machines, and People. Berkeley AMPLab uses a variety of resources, including big data, cloud computing and communications, as well as a variety of flexible technology solutions, to sift through vast amounts of opaque data and turn it into useful information for people to better understand the world. The ecosystem is already involved in machine learning, data mining, databases, information retrieval, natural language processing and speech recognition. The Spark ecosphere uses Spark Core as the Core, reads data from the persistence layer such as HDFS, Amazon S3, and HBase, and uses MESS, YARN, and the Standalone resource manager to schedule the Job to compute the Spark application. These applications can come from different components, For example, Spark Shell/Spark Submit batch processing, Spark Streaming real-time processing applications, Spark SQL AD hoc query, BlinkDB tradeoff query, MLlib/MLbase machine learning, GraphX graph processing and SparkR mathematical calculation and so on.

Spark Core

The basic information of Spark Core has been introduced. The Spark kernel architecture is summarized as follows:

  • The distributed parallel computing framework of directed Acyclic graph (DAG) is provided, and the Cache mechanism is provided to support multiple iterations or data sharing, which greatly reduces the cost of reading data between iterations, and greatly improves the performance of data mining and analysis requiring multiple iterations
  • In Spark, the Resilient Distributed Dataset (RDD) abstraction is introduced. It is a collection of read-only objects Distributed in a group of nodes. These collections are elastic.
  • Mobile computing instead of moving data, RDD partitions can read blocks of data from the distributed file system to the memory of each node for calculation
  • Use a multi-threaded pool model to reduce task startup latency
  • Akka, which is fault-tolerant and highly scalable, is adopted as the communication framework

SparkStreaming

SparkStreaming is a high-throughput, fault-tolerant streaming system for real-time data streams that can perform complex operations like Map, Reduce and Join on multiple data sources such as Kdfka, Flume, Twitter, Zero and TCP sockets. Save the results to an external file system, a database, or apply them to a real-time dashboard.

Computation flow: Spark Streaming is a decomposition of Streaming computation into a series of short batch jobs. The batch engine here is Spark Core, which means that the input data of Spark Streaming are divided into discrete Stream according to the Batch size (e.g., 1 second). Resilient Distributed Dataset (RDD) in Spark. Then change the Transformation to DStream in Spark Streaming to the Transformation to RDD in Spark, and save the intermediate result of the RDD in the memory. The entire streaming computing can overlay intermediate results or store them to external devices, depending on the needs of the business. The following figure shows the entire flow of Spark Streaming.

Spark SQL

Shark is a precursor to SparkSQL. It was released three years ago when Hive was the only alternative to SQL on Hadoop and was responsible for compiling SQL into scalable MapReduce jobs. Shark was born out of Hive’s performance and compatibility with Spark. Shark is Hive on Spark. In essence, Hive HQL parsing is used to translate HQL into RDD operations on Spark, and Hive metadata is used to obtain table information in the database. Data and files in the HDFS are obtained by Shark and calculated on Spark. Shark is fast and fully compatible with Hive. It can use apis such as RDD2SQL () in shell mode to compute HQL result sets in Scala and write simple machine learning or simple analysis and processing functions to further analyze and calculate HQL results.

At the Spark Summit on July 1, 2014, Databricks announced that development of Shark was being discontinued to focus on Spark SQL. Databricks says Spark SQL will cover all of Shark’s features and users will be able to upgrade seamlessly from Shark 0.9. At the conference, Databricks said Shark is more of a reworking of Hive, replacing Hive’s physical execution engine, and is therefore much faster. However, it’s important to note that Shark inherits a large amount of Hive code, which can cause a lot of optimization and maintenance headaches. As performance optimization and advanced analysis integration deepened, mapReduce-based design became the bottleneck of the entire project. Therefore, in order to move forward and provide a better user experience, Databricks has announced the termination of the Shark project to focus more on Spark SQL.

Spark SQL allows developers to work directly with RDD, as well as query external data that exists on Apache Hive, for example. An important feature of Spark SQL is its ability to unify relational tables and RDD, making it easy for developers to use SQL commands for external queries while performing more complex data analysis. In addition to Spark SQL, Michael talked about the Catalyst optimization framework, which allows Spark SQL to automatically modify query schemas to make SQL execute more efficiently.

Features of Spark SQL

L introduces a new RDD type, SchemaRDD, that can be defined in the same way that traditional database tables are defined. SchemaRDD consists of row objects that define column data types. SchemaRDD can be converted from RDD, read from a Parquet file, or obtained from Hive using HiveQL.

  • The built-in Catalyst query optimization framework, after parsing SQL into logical execution plans, uses some classes and interfaces in the Catalyst package to perform some simple execution plan optimization, which finally becomes RDD calculation
  • You can mix data from different sources in your application, such as joining data from HiveQL with data from SQL

Shark improves SQL-on-Hadoop performance by 10-100 times compared to Hive. How does SparkSQL perform without Hive? While Shark doesn’t have the same dramatic performance improvement as Hive, it still performs very well, as shown in the following figure:

Why is sparkSQL getting such a big performance boost? The main sparkSQL optimizations are as follows:

  • In-memory Columnar Storage sparkSQL table data is stored In Memory by in-memory Columnar Storage instead of the original JVM object Storage.
  • Bytecode Generation Spark1.1.0 added the CodeGen module to the Catalyst module, which uses dynamic Bytecode Generation technology to dynamically compile matching expressions with specific code. In addition, ALL SQL expressions are CG optimized, and the realization of CG optimization mainly relies on runtime Reflection of Scala2.10.
  • SparkSQL tries to avoid inefficient, GC prone code when writing code in Scala; Although it makes writing code more difficult, the interface is unified for users.
  • BlinkDB is a massively parallel query engine for running interactive SQL queries on massive amounts of data. It allows users to improve query response time by balancing data accuracy within an allowable margin of error. To achieve this, BlinkDB uses two core ideas:

An adaptive optimization framework to build and maintain a set of multidimensional samples from raw data over time; A dynamic sample selection strategy that selects an appropriately sized sample based on the accuracy and/or response time requirements of the query.

Unlike traditional relational databases, BlinkDB is an interesting interactive query system, like a seesaw, where users have to make trade-offs between query accuracy and query time. If the user wants to get the query results faster, the accuracy of the query results will be sacrificed. Similarly, users need to sacrifice query response time if they want to obtain higher precision query results. Users can define an error boundary at query time.