How do MaxCompute diagnose slow jobs with logview

MaxCompute is dedicated to the storage and calculation of bulk structured data. It provides solutions for massive data warehouses and analytical modeling services. MaxCompute can sometimes be slow to perform SQL tasks

The problems with slow tasks are divided into the following categories

Queuing caused by insufficient resources (usually for annual and monthly projects)
Data skew, data bloat
Inefficiency caused by the user’s own logic

First, insufficient resources

A common SQL task will consume CPU and Memory resources. How to view the reference link

1.1 Look at job elapsed time and execution phases

1.2 Waiting for task submission

If you continue to display “Job Queueing…” after submitting a task It is possible that other people’s tasks are occupying resources of the resource group, so that their own tasks are queued.

Waiting for scheduling in SubStatusHistory Waiting for scheduling in SubStatusHistory

1.3 Insufficient resources after task submission

There is another case where the task can be successfully committed, but the current resource group cannot start all instances at the same time due to the large amount of resources required, resulting in a situation where the task is progressing but not executing fast. This can be observed through the Latency Chart feature in LogView. The latency chart can be seen by clicking on the corresponding task in the detail.

The figure above shows a well-resourced task running, and you can see that the bottom of the blue section is flush, indicating that all instances were started at about the same time.

The lower end of the graph presents the shape of ladder upward, indicating that the task instances are scheduled bit by bit, and the resources are not sufficient when the task is run. If the task is of high importance, consider adding more resources or reprioritizing the task.

1.4 Reasons for insufficient resources

1. Check whether CU is full through the Cu housekeeper, click to the corresponding task point, find the corresponding time to see the situation of homework submission

Sort by CPU proportion

(1) If a task consumes too much CU, find the large task to see what causes it (too many small files, too much data really need so many resources). (2) The uniform ratio of CU indicates that multiple large tasks are submitted at the same time to fill the CU resources directly.

2. Slow CU due to too many small files

The degree of parallelism in the Map stage is based on the shard size of the input file, thus indirectly controlling the number of workers in each Map stage. Default is 256M. If a small file is read as a block, the I/O bytes of each task in M1 in the map phase as shown below are only 1M or dozens of KB, so more than 2500 degrees of parallelism are caused and resources are filled up instantly, indicating that there are too many files in the table and small files need to be merged.

Merge small files https://help.aliyun.com/knowl…

3. Large amount of data leads to full resources

Can increase the purchase resources, if it is a temporary assignment can add setodps. Task. Quota. Preference. The tag = payasyougo; Parameter that allows the specified job to temporarily run to the large pay-as-you-go resource pool.

1.5 How to adjust task parallelism

The parallelism of MAXCOMPUTE automatically predicts execution based on the input data and the complexity of the task. There is generally no need to adjust. Ideally, the greater the parallelism, the faster the processing

Map phase parallelism

Odps. Stage. Mapper. The split. Size: each Map Worker input data is modified, the fragmentation of the input file size, thereby indirectly control the number of Worker in stages each Map. In MB, the default value is 256 MB of parallelism

Odps.stage.reducer. num: Modify the number of workers in each Reduce phase

Odps. Stage. Num: Modify MaxCompute specified task under all Worker concurrency, priority is lower than the odps. Stage. The mapper. The split. The size, odps. Stage. The reducer. Mem and (odps) stage. The joiner. Num attribute.

Odps.stage.joiner. num: Modifies the number of workers in each Join stage.

2. Data skew

Data skew

Most of the instances in a task have ended, but some of the instances are not. As shown in the figure below, most (358) Instances have finished, but 18 of them are Running. These Instances are Running slowly, either because they are processing too much data or because they are slow to process specific data.

The solution: https://help.aliyun.com/docum…

Three, the logic problem

This means that the user’s SQL or UDF logic is inefficient or does not use optimal parameter Settings. This occurs when a Task takes a long time to run, and each instance takes an even time to run. There are a lot of different scenarios, some of which are really logically complex, and some of which have a lot of room for optimization.

Inflation data

The output data of a task is much larger than the input data.

For example, 1G of data is transformed into 1T after processing. If 1T of data is processed under one instance, the operating efficiency will definitely be greatly reduced. The amount of input and output data is reflected in the I/O Record and I/O Bytes of the Task:

Solution: Confirm that the business logic really needs this and increase the parallelism of the corresponding phases

UDF execution is inefficient

A task that executes inefficiently and has a user-defined extension in it. Even the UDF execution timeout error: “Fuxi job failed – WorkerRestart errCode: 252, errMsg: kInstanceMonitorTimeout, usually under caused by bad udf performance”.

First, determine the UDF location. Click on the slow Fuxi Task to see if a UDF is included in the Operator Graph. For example, the following figure shows a Java UDF.

You can see how fast this operator is running by looking at the STDOUT of Fuxi Instance in LogView. Normally, the Speed(records/s) is in the millions or hundreds of thousands.

Solution: Check UDF logic using built-in functions whenever possible

The original link

This article is the original content of Aliyun, shall not be reproduced without permission.

How do MaxCompute diagnose slow jobs with logview

First, insufficient resources

2. Data skew

Three, the logic problem

Related Posts

Thread pool source code (1)

JAVA docking domestic 1200 express companies express mail sample code

Why does Ali prohibit the use of stored procedures?