Abstract:

Headache

A common question for MaxCompute users is: Why is the same periodic task slower in recent days than before? Or why a job that used to produce on time has recently been broken?

Typically, slow job execution can be caused by a lack of quota, changes in input data, changes in data distribution, platform hardware failures causing reruns, etc. This article describes how to use MaxCompute Studio’s job execution graph and job details to help locate problems caused by data changes.

MaxCompute Studio debut

Let’s take an example. The following is the execution of the same task on May 7 and May 9 respectively, which are called operation 1 and Operation 2 respectively:

Task 1:24 minutes and 38 seconds

Task 2:9 minutes and 40 seconds

First, compare SQL with execution plans

Generally speaking, the two SQL scripts should be compared to see if they are the same. If the job scripts have been changed, the impact of the changes should be analyzed first. If the script content is consistent and you then need to compare the execution plan, you can use MaxCompute Studio’s job execution plan diagram feature (see documentation) to visually see how the two execution plan diagrams look the same. (Job comparison is being developed. In the next version of Studio, you will be able to compare two jobs with one click, and note the differences between SQL scripts and other parts. Stay tuned!)

In this example, the SQL scripts and execution plans for the two jobs are exactly the same. For data security reasons, the specific SQL content is not pasted here.

Again, compare the amount of data

The second step is to see if there is a significant change in the amount of data input to the job. It is possible that the data input to the table or partition skyrocketed on a certain day, which caused the job execution to slow down. The job input data is on the left side of the Detail page, where all the input tables and the final output tables for the job are listed. Expand Input ->Table Name to view the details, including which Fuxi task read the Table, how many records were read, and how much data was read. This information is also marked on the Job Execution Plan graph for easy viewing. As shown in the picture below,

Input table and read data for assignment 1

Input table and read data for Task two

You can also read the input rows or bytes directly from the graph (rows are displayed by default, and you can right-click on the side to switch to bytes) :

Assignment 1 graph input the number of rows

Task two graph input the number of rows

In this example, in terms of the amount of input data, there is not much difference between the two executions.

Homework playback is the best tool

Step three, how exactly is the job running? Running for that long, or spending what? No hurry, use homework playback quickly! The job playback toolbar at the bottom of the Job execution plan diagram in Studio allows users to quickly replay the entire process of job execution progress in 30 seconds! This allows us to quickly see which Task takes the most time, processes the most data, outputs the most data, etc.

By replicating the job execution, Studio provides progress charts and heat maps to facilitate job analysis from different dimensions.

Homework 1 Playback of progress chart

Playback of task 2 progress chart

It can be seen from the replay that (1) Task J6_2_4_5 is the bottleneck of the whole operation and consumes the most time. It can also be seen from the heat map of task that this node is obviously hotter than other nodes. (Red on the time heat map means longer running time, and red on the data heat map means more data processing)

(2) By comparing the playback process of the two jobs at the same time, it can be found that the running time of job 1 in J6_2_4_5 is significantly longer than that of job 2, indicating that job 1 slows down in J6_2_4_5, and a long tail node is suspected initially.

Next switch to the timing diagram TAB and compare the execution timeline of the two jobs:

Job 1 executes timeline

Task 2: Execute timeline

Although the timescales of the two job timelines are different, it is obvious that J6_2_4_5 occupies a longer proportion in job 1, so it can be concluded that the execution of J6_2_4_5 from May to July 7 (i.e. job 1) may have a long tail, resulting in the slow execution of the whole job.

Entry into deep water

The fourth step is to compare the input data amount of J6_2_4_5 by graph or detail TAB

Homework one detail TAB

Homework two detail TAB

J6_2_4_5 input data volume and statistical information in the figure above, we can see that the two J6_2_4_5 input data volume is basically the same, 1.58 trillion bytes vs 1.7 trillion bytes. From the perspective of statistical information, the cumulative execution time and average execution time are basically the same: 292398 vs 307628, 72s vs 76S However, the longest execution time in Operation 1 FUXI Instances is 710s, so it can be identified that the long tail of some FUXI instances causes the long tail of J6_2_4_5 FUXI task.

Select latency from J6_2_4_5 fuxi instance list;

Homework a

Assignment 2

Or go to the Long tail page under the Analysis TAB to find the long tail node: Note that the scale in this diagram is calibrated in equal proportions rather than equal differences, so the long burr protruded above is likely an example of the long tail. This can be determined by specific information in the floating window. Studio also provides diagnostic pages to automatically find long-tail instances that take more than twice the average instance execution time.

Homework a

Assignment 2

From the above comparison of fuxi instance input data, it can be determined that J6_2_4_5#1912_0 instance data skew causes the whole job to have a long tail. J6_2_4_5#1912_0 is 7 times larger than latency.

Getting to the bottom of it

The sixth step is to look at the execution log of a single instance and analyze the specific execution plan of J6_2_4_5 by job Graph to find the data source that causes the long tail.

If you open the operator graph of J6_2_4_5, you can see that there are two joins: Merge Join2 and Merge Join3. This section uses Merge Join2 as an example to show how to find data sources.

In Merge Join2, you can see join key: _col13, user_id. Where _COL13 comes from J5, IF(ISNULL(HOT_SELLER_MAP(sellerID)), sellerID,CONCAT(seller,TOSTRING(RAND) ())))) The seller comes from M4 and M3.

J5 Operator Details

Open the Operator details of M4 and M3 respectively, and you can see that the Seller comes from tMP_S_DW_LOG_APP_USER_track_PRE_1_20180508 and DIM_feed_shop respectively.

M4 Operator details

M3 Operator details

The user_id of Merge Join2 is from DIM_tb_shop.

Finally, write SQL simulation to generate the corresponding userID and _COL13, compare the data volume of these two fields, and optimize the SQL script.

The original link