Transfer: yq.aliyun.com/articles/60…

Data skew

During the Shuffle optimization, data skew occurs, resulting in poor optimization effects in some cases. The main reason is that the Counters obtained after the Job is completed are the sum of the whole Job, and the optimization is based on the average of these Counters. However, due to data skew, the difference in the amount of data processed in the map is too large, which reduces the value represented by the average. Hive execution is performed in stages. The difference in the amount of data processed by a Map depends on the Reduce output of the previous stage. Therefore, how to evenly distribute data to each Reduce is the fundamental solution to data skewness. Avoiding errors to perform better is more efficient than fixing them.

1. Reasons for data skew

1.1 operating

keywords situation The consequences
Join One of the tables is smaller, but key concentrated The data distributed to one or more Reduces is much higher than average
Large table and large table, but bucket judgment field 0 value or null value too many These null values are processed by a Reduce, which is very slow
group by group by Too few dimensions, too many values Processing a certain value of Reduce is time-consuming
Count Distinct Too many special values The reduce time spent processing this particular value

1.2 the reason

1. Uneven distribution of keys 2. Features of service data 3. Some SQL statements already have data skewCopy the code

1.3 performance

The task progress remains at 99% (or 100%) for a long time. On the task monitoring page, only one or several Reduce subtasks are not completed. Because the amount of data it processes is too different from other Reduce programs. The number of records for a single Reduce differs greatly from the average number of records, which may be 3 times or more. The longest time is longer than the average time.

2 Data skew solution

2.1 Parameter Adjustment

Hive.map. aggr =true Partial aggregation on the map is equivalent to Combiner hive.groupby. skewinData =true Load balancing is performed when data is skewed. If this parameter is set to true, the generated query plan contains two MR Jobs. In the first MR Job, the output result set of Map is randomly distributed to Reduce tasks. Each Reduce task performs partial aggregation operations and outputs the result. In this way, the same Group By Key may be distributed to different Reduce tasks. So as to achieve the purpose of load balancing; The second MR Job is then distributed to Reduce according to the Group By Key according to the preprocessed data results (this process ensures that the same Group By Key is distributed to the same Reduce). Finally, the final aggregation operation is completed.Copy the code

2.2 SQL Statement Adjustment

  • How to Join: Regarding the selection of driver tables, the table with the most uniform Join key distribution is used as the driver table to perform column clipping and filter operation, so as to achieve the effect of relatively small data amount when two tables are joined.
  • Size table Join: Use map Join to advance small dimension tables (less than 1000 entries) into memory. Complete reduce on the Map side.
  • Join big table: The null key is changed into a string and a random number is added to divide slanted data into different Reduce files. Since the null value cannot be associated, the final result is not affected after processing.
  • Count DISTINCT If there are a large number of the same special value count DISTINCT, the count DISTINCT value is null. If count DISTINCT is to be calculated, you can directly filter and add 1 to the final result. If there are other calculations that need to group by, you can first process the record with the null value separately and then union it with other calculation results.
  • If the group by dimension is too small, the count(distinct) is replaced by the sum() group by method. Special case Special processing: In the case of small business logic optimization effect, sometimes it is possible to separate the skewed data for processing. Finally union goes back.

2.3 Data skew caused by null values

Scenario: For example, information is often lost in logs, such as user_id in logs. If user_id is associated with user_id in the user table, data skew may occur.

Solution 1: If the user_id is empty, select * fromlog a join users b on a.user_id is not null and a.user_id = b.user_idunion allselect * from log a wherea.user_id is null; Solution 2: assign a null value to a new key value select * fromlog a left outer join users b on case when a.user_id is null thenConcat (' hive, rand ())else a.user_id end = b.user_id;
Copy the code

Theory: Method 2 is more efficient than method 1, not only fewer I/OS, but also fewer jobs. In solution 1, log is read twice, and Jobs is 2. Solution 2 The number of jobs is 1. This optimization is suitable for skew problems caused by invalid ids (such as -99, “, NULL, etc.). By changing the null key into a string and adding a random number, skew data can be divided into different Reduce files to solve the data skew problem.

3 Data skew occurs when different data types are associated

Scenario: The user_id field in the user table is an int, and the user_id field in the log table has both string and int types. When a Join operation is performed on two tables based on user_id, the default Hash operation is assigned to a Reducer based on an INT ID. As a result, all records with string ids are assigned to a Reducer.

Select * from users a left outer join logs B on A.us_id = cast(b.us_id as string) select * from users a left outer join logs B on A.us_id = cast(b.us_id as string)Copy the code

3.1 How can MAP Join be used to solve the skew problem

Map Join is used to solve the problem of data skew when a small table (with fewer records) is associated with a large table. This method is frequently used. However, if the small table is too large, there will be bugs or exceptions in map Join, which requires special treatment. Select * from log a left outer join users b on a.user id = b.user id; select * from log a left outer join users b on A.user id = B.user id;

The Users table has 600W + records. Distributing users to all maps is not a small overhead, and map Join does not support such large small tables. If ordinary JOIN is used, the problem of data skew will be encountered again.

Select * from mapJoin (x)*/* from mapJoin (xlog a left outer join ( select /*+mapjoin(c)*/d.* from ( select distinct user_id from log ) c join users d on c.user_id = d.user_id ) x on a.user_id = b.user_id;
Copy the code

If there are millions of userids in the log, then this goes back to the original Map Join problem. Fortunately, the daily member UV is not too much, not too many members with transactions, not too many members with clicks, not too many members with commissions and so on. Therefore, this method can solve the problem of data skew in many scenarios.

4 summarizes

Making map output data more evenly distributed to Reduce is our ultimate goal. Due to the limitations of the Hash algorithm, pressing the Hash key will result in more or less data skew. Plenty of experience shows that the cause of data skew is human carelessness in table building or can be circumvented by business logic. More general steps are given here:

  1. Sample the log table, which user_id is slanted, and get a result table Tmp1. As for the computing framework, all the data come, he does not know the data distribution, so sampling is not less.
  2. Data distribution conforms to sociological statistical rules, inequality. Tilted key will not be too much, just like a society of not many rich people, not many strange people. So the number of TMP1 records will be small. Create a map join between TMP1 and users to generate TMP2, and read TMP2 to distribute file cache. This is a MAP process.
  3. Map reads users and log. If the record is from log, check whether user_id is in TMP2. If so, output it to local file A. Otherwise, the generated key and value pairs enter reduce.
  4. Finally, the A file and the files output during Stage3 Reduce are merged and written to HDFS.

5. If it is determined that the business needs such slanted logic, consider the following optimizations:

  1. For JOIN, map Join is used when the size of the small table is less than 1 GB
  2. For groupby or distinct, set hive.groupby. skewinData =true
  3. Optimize as much as possible using the SQL statement tuning described above