DataMagic: How to use Spark for trillions of data

Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

This article was first published in cloud + community, shall not be reproduced without permission.

Author: guo-peng zhang | tencent operation development engineers

One, foreword

As a big data computing engine, Spark is fast, stable, and easy to use, rapidly occupying the field of big data computing. This article focuses on the author’s understanding of Spark in the process of building and using a computing platform, hoping to give readers some ideas for learning. This article introduces Spark’s role on the DataMagic platform, how to quickly learn Spark, and how to use Spark well on the DataMagic platform.

Spark’s role in the DataMagic platform

Figure 2-1

The main functions of the entire architecture are log access, query (real-time and offline), and computing. The offline computing platform is mainly responsible for computing, and the storage of the system is COS(company internal storage) rather than HDFS.

The following describes the architecture of Spark on Yarn. Figure 2-2 shows the operation process of Spark on Yarn.

Figure 2-2

3. How to quickly master Spark

To understand Spark, I think the following four steps are enough.

1. Understand the Spark terminology

To get started, learn Spark architecture diagram to quickly understand the key terms. After mastering the key terms, you will have a basic understanding of Spark. Shuffle, Patitions, MapReduce, Driver, Application Master, Container, Resource Manager, and Node Manager respectively. API programming terms Key RDD and DataFrame. Structure terms are used to understand how Spark works. API terms are used to write code while using Spark.

2. Master Key Configurations

When Spark is running, a lot of running information is read from the configuration file, usually in spark-defaults.conf. To use Spark properly, you need to master some key configurations, such as memory configuration. Spark. Yarn. Executor. MemoryOverhead, spark. Executor. Memory, with associated spark.net work overtime. A timeout, and so on, the spark a lot of information can be changed, by configuring Therefore, you need to have a good command of configuration. For example, spark. Speculation is configured to speculate execution. In the case of a slow execution of Worker1, Spark will start a worker2 that performs the same task as Worker1. This feature is very good for general computing tasks. However, if there are two same workers at the same time when a task is executed from the database to Mysql, the data of Mysql will be repeated. Therefore, when using the configuration, it is important to understand that there are many configurations listed directly in the Google Spark Conf.

3. Use Spark parallelism well

The reason why we use Spark for calculation is that it is fast, but the reason why it is fast is largely due to its parallelism. To learn how Spark provides parallel services, we can better improve parallelism.

To improve parallelism, there are several aspects to RDD. 1. Configure num-executor. 2. Configure executor-cores. 3, configuration, spark. Default. Parallelism. The relations between and among general for spark. Default. Parallelism = num – executors * executor – 2 ~ 3 times of cores is more appropriate. For the Spark – SQL, then set the Spark. SQL. Shuffle. Partitions, num – executor and executor – cores.

4. Learn how to modify Spark code

Beginners, especially when they need to optimize or modify Spark, can feel confused. In fact, we can focus on the local first, and Spark is modular. There is no need to think of Spark as complex and difficult to understand.

First of all, The directory structure of Spark is shown in Figure 3-1. You can quickly know the location of SQL, Graphx and other codes through folders. The running environment of Spark is mainly supported by JAR packages, as shown in Figure 3-2. All JAR packages can be compiled using the Spark source code. To modify a function, you only need to find the code of the corresponding JAR package, compile the JAR package, and replace it.

Figure 3-1

Figure 3-2

For compiling source code, in fact, it is also very simple, install maven, Scala and other related dependencies, download source code to compile, master the skill of modifying source code for the use of open source projects is very important.

Spark in DataMagic platform

Spark’s use in DataMagic is also a process of exploring while using it. In this process, some important features of Spark are listed.

1. Rapid deployment

In computing, the number of computing tasks and the magnitude of data changes every day. Therefore, the Spark platform needs rapid deployment features. On a physical machine, a one-click deployment script is used to launch a physical machine with 128 gb of memory and 48cores. However, physical computers usually need to apply for and report to obtain, so docker will also support computing resources.

2. Use configuration wisely to optimize calculation

Most properties of Spark are implemented through configuration. Therefore, you can dynamically change the running behavior of Spark by configuration. For example, you can automatically adjust the number of exectors by configuration.

2.1 Adding the configuration in yarn-site. XML on nodeManager

    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
      <value>org.apache.spark.network.yarn.YarnShuffleService</value>
   </property>Copy the code

2.2 Copying the spark-2.1.0-yarn-shuffle. jar file to the hadoop-yarn/lib directory (the yarn library directory)

2.3 Add the configuration in spark-default. XML of Spark

spark.dynamicAllocation.minExecutors 1 # Minimum number of executors
spark.dynamicAllocation.maxExecutors 100 # Maximum number of executorsCopy the code

With this configuration, you can automatically adjust exector.

3. Allocate resources wisely

As a platform, its computing tasks are certainly not fixed. Some data have a large amount, while others have a small amount. Therefore, it is necessary to allocate resources reasonably. But some data of the order of ten billion will need to allocate more computing resources. Refer to point 3 of Section 3.

4. It fits business needs

In fact, the purpose of computing is to serve the business, and the business needs should also be the pursuit of the platform. When the business has reasonable needs, the platform should try to meet them. For example, Spark supports the Cmongo data outbound mode to support high concurrency and real-time query.

sc = SparkContext(conf=conf)

sqlContext = SQLContext(sc)

database = d = dict((l.split('=') for l indbparameter.split())) parquetFile = sqlContext.read.parquet(file_name) parquetFile.registerTempTable(tempTable) result =  sqlContext.sql(sparksql) url ="mongodb://"+database['user'] +":"+database['password'] +"@"+database['host'] +":"+database['port']    result.write.format("com.mongodb.spark.sql").mode('overwrite').options(uri=url,database=database['dbname'],collection=pg_table_name).save()Copy the code

Spark, as a general computing platform, does not need to be modified in common application scenarios. However, on the DataMagic platform, we need to “change as we go”. Here for a simple scenario, in log analysis, log level to the pull of the billions of/day level, when some of the underlying logging field appears utf-8 can parse, calculation and abnormal will happen in the Spark, then failed, however, if the data before landing in garbled data filtering, may affect the efficiency of data collection, Therefore, we decided to solve this problem during Spark calculation. Therefore, we added exception judgment to the code that converts data during Spark calculation to solve this problem.

6. Locate the Job problem

When Spark fails to calculate a Job, you need to locate the cause of the failure. If the Job fails, you can use yarn logs-ApplicationId Application to merge the task log, open the log, and locate the Traceback. Generally speaking, failures can be divided into several categories.

A. Code problems. The WRITTEN Sql has syntax problems, or the Spark code has problems.

B. Spark. The old Spark version processes NULL values.

C. If the task is in the Running state for a long time, data skew may occur.

D. Task memory is out of bounds.

7. Cluster management

The Spark cluster needs operation and maintenance in daily use. In this way, problems can be found and the cluster can be optimized. The following describes how to ensure the robustness and stability of the cluster and ensure the smooth execution of tasks.

A. Periodically check whether there are lost nodes or unhealthy nodes. You can periodically set alarms based on scripts.

B. Periodically scan for HDFS running logs and check whether they are full. Periodically delete expired logs.

C. Periodically check whether cluster resources meet computing tasks and can be deployed in advance.

Five, the summary

This paper mainly describes the author’s understanding of Spark and how Spark is used in DataMagic. The current platform has been used for offline analysis, and the amount of data calculated and analyzed every day has reached 100 billion to 100 trillion levels.

Question and answer

How do I resolve the dependency problem in Apache Spark?

reading

Technology sharing | Spark RDD explanation

Apache Spark Quick Start

Spark Streaming Knowledge Summary [Optimization]

Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1092587?fromSource=waitui