1. Memory overflow

1.1 Client Memory Overflows

Hive > select count(1) from test_tb_1_1;

Query ID = hdfs_20180802104347_615d0836-cf41-475d-9bec-c62a1f408b21
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. Java heap space
Copy the code

Cause: The statement scans for all partitions of the table. If the number of partitions in the table is large and the amount of data is large, an error may occur indicating that the client memory is insufficient. Note: The client reports a memory overflow. If the application ID of the job is not displayed in the logs generated by the client, a memory overflow exception is reported, and no information about the job is displayed on ResourceManager.

Note As a client, parameters must be specified when hive is started and cannot be modified after it is started. Therefore, before running the Hive command, Change the environment variable export HIVE_CLIENT_OPTS=” -XMx1536m -xx :MaxDirectMemorySize= 512M “. The default value is 1.5 GB. You can increase the value as required. For example, export HIVE_CLIENT_OPTS=” -XMx2G -xx :MaxDirectMemorySize= 512M”

1.2 ApplicationMaster Memory overflows

For demonstration purposes, reduce the RAM size of the AM to 512MB. Then the client enters the following error message:

set yarn.app.mapreduce.am.resource.mb=512; select count(1) from (select num, count(1) from default.test_tb_1_1 where part>180 and part<190 group by num) a;

Query ID = hdfs_20180802155500_744b90db-8a64-460f-a5cc-1428ae61931b
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 849
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1533068988045_266336, Tracking URL = http://xxx:8088/proxy/application_1533068988045_266336/
Kill Command = hadoop job -kill job_1533068988045_266336
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2018-08-02 15:55:48,910 Stage-1 map = 0%, reduce = 0%
Ended Job = job_1533068988045_266336 with errors
Error during job, obtaining debugging information...
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Copy the code

The application ID information is displayed in the error message, indicating that the task is submitted to ResourceManager. The task is not caused by the Client. However, check whether the Container fails. It can be preliminarily judged that the problem is caused by insufficient memory of application master, and further confirmed that two approaches are needed:

1) Check the program running information on the ResourceManager page, and the following error message is displayed:

Current usage: 348.2 MB of 512 MB physical memory used; 5.2 GB of 1.5 GB virtual memory used. Killing container.

It means memory is out, Application application_1533068988045_266336 failed 2 times due to AM Container for Appattempt_1533068988045_266336_000002 exited with exitCode: -103

2) You can run the yarn command to obtain log information

Yarn logs -applicationId Application_1533068988045_266336 > Application_1533068988045_266336. log The data in this log information is the same as the preceding data. The identification method is the same. Note: Use this command to note that the log file is very large, so it is best not to do this on the system disk. Application Master memory overflow set the following parameters (in MB)

set yarn.app.mapreduce.am.resource.mb=4096; set yarn.app.mapreduce.am.command-opts=-Xmx3276M; # note: Java. Opts is memory. About 80% of the MB set yarn. The app. Graphs. Am. Resource. The CPU - vcores = 8; #-- The number of virtual cpus occupied by MR ApplicationMasterCopy the code

Memory on the Map and Reduce ends overflows

Memory overflow occurs on the Map and Reduce ends:

You will first see a message indicating that the Container is out of memory (for example, beyond Physical Memory) :

Container [pid=192742,containerID=container_e1383_1533068988045_250824_01_003156] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 3.7 GB of 6.2 GB virtual memory used. Killing container.

Attempt_1533068988045_250824_r_000894_3 R and M in the attempt information attempt_1533068988045_250824_R_000894_3 can be found to know whether the Map or Reduce container is faulty. The error log here is that the Reduce container hung.

Most Task with the failures (4) : Task ID: task_1533068988045_250824_r_000894 URL: http://0.0.0.0:8088/taskdetails.jsp? jobid=job_1533068988045_250824&tipid=task_1533068988045_250824_r_000894 Diagnostic Messagesfor this Task:
Container [pid=192742,containerID=container_e1383_1533068988045_250824_01_003156] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 3.7 GB of 6.2 GB virtual memory used. Killing container.

Copy the code

Solution: Modify the following parameters (unit: MB) before executing the SQL statement:

Map container memory overflow Set the following parameters

set mapreduce.map.memory.mb=4096;
set mapreduce.map.java.opts=-Xmx3276M;
Copy the code

Reduce container memory overflow Set the following parameters

set mapreduce.reduce.memory.mb=4096; set mapreduce.reduce.java.opts=-Xmx3276M; Note: Java.opts is about 80% of memory.MB. The minimum unit of memory increase is 1GB, based on the size of memory overflow (for example: Current usage: 2.0 GB of 2 GB physical memory used).Copy the code

Parameter Description:

  • Mapreduce.map.memory. MB: indicates the upper limit of memory (unit :MB) that a Map Task can use. The default value is 1024. If the actual resource usage of the Map Task exceeds this value, the Task is forcibly killed.
  • Mapreduce.map.java. opts: JVM parameter of a Map Task, where you can configure the default Java heap size parameter, which is typically set to 70%-80% of the previous parameter.
  • Graphs. Reduce. The memory. MB: a reduce Task can use resource limit (unit: MB), the default is 1024. If the amount of resources used by the Reduce Task exceeds this value, the Task is forcibly killed.
  • Graphs. Reduce. Java. Opts: reduce the Task of the JVM parameter, you can configure the default Java heap size parameter, generally on a parameter set to 70% – 80%.
  • mapreduce.map.cpu.vcores
  • mapreduce.reduce.cpu.vcores

2. Data skew

To put it simply, data skew refers to the uneven differentiation of data keys, resulting in a situation where some data are too much and some data are too little.

Take word count as an example to get started. Its map stage is to form the form (” AAA “,1), and then add values in the Reduce stage to get the number of occurrences of “AAA”. If the text for word count is 100GB, 80GB of which are all “AAA” and the remaining 20GB are other words, 80GB of data will be formed and handed to a Reduce for addition. The remaining 20G is dispersed to different Reduce according to different keys for sum. This results in data skew, and the clinical response is to run to 99% of reduce and then wait for the 80GB of reduce to run out.

2.1 Computational equalization optimization in Join

In Hive, join operations are usually performed during the Reduce phase. When writing SQL, place small tables on the left side of a Join. During the Reduce phase of a Join operation, the contents of the tables on the left side of the Join operator will be loaded into memory, and the tables with fewer entries will be placed on the left side. Can effectively reduce the probability of out of memory errors. Reduce Joins of a large table and a configuration table often result in computation imbalances.

select user.id,gender_config.gender_id
from gender_config 
join user on gender_config.gender=user.gender
Gender_config (Gender string,gender_id int)
Copy the code

Gender has only two values: male and female. Hive uses Join_key as a reduce_key when processing joins. Therefore, there is an imbalance in reduce calculation similar to group by.

select /*+ MAPJOIN(gender_config) */ user.id, gender_config.gender_id 
from gender_config 
join user on gender_config.gender=user.gender
Copy the code

Such large tables and configuration tables are usually solved by mapJoin. Hive currently uses /*+ MAPJOIN(Gender_config) */ to tell the translator to translate SQL into MAPJOIN. The prompt must specify the configuration table. Each map reads the small table to the Hash table and then hash joins the large table. Therefore, the key of Map Join is that small tables can fit into the memory of the Map process. If not, they will be serialized to the hard disk, resulting in a sharp decrease in efficiency.

Thousands of maps read the small table from HDFS into their own memory, making the read operation of the small table become a bottleneck of join. Sometimes some maps even fail to read the small table (because there are too many processes reading at the same time), resulting in join failure. The temporary solution is to increase the number of copies of the small table. The next step in optimization can be to put small tables into the Distributed Cache, and map can read the local file.

parameter

  1. mapjoin
  • set hive.auto.convert.join=true; Convert a normal Join to an optimization of mapJoin based on the size of the input file. False is disabled by default.
  • set hive.mapjoin.smalltable.filesize=50000000; # The threshold for the input file size of the small tables; if the file size is smaller than # this threshold, it will try to convert the common join into map join
  1. Skew Join
  • set hive.optimize.skewjoin=true; — Set to true if the join process is skewed
  • set hive.skewjoin.key=100000; If the number of records corresponding to the key of join exceeds this value, it will be split. The value is set according to the specific data amount. When Hive is running, it is impossible to determine how much skew will be generated by a key. Therefore, use this parameter to control the skew threshold. If the value exceeds this threshold, the new value is sent to those that do not reach the threshold.

scenario

1. Associating large tables with small tables

For example, a join association between a record table with tens of millions of rows and a table with thousands of rows is prone to data skew. Why are large tables and small tables prone to data skew (sometimes reduce execution time is delayed)? For details, see Join performance analysis of small – and medium-sized tables in Hive

Solution:

  1. If multiple tables are associated, put smaller tables (tables with fewer associated key records) in front of each other. In this way, reduce operations are triggered and the running time is reduced.
  2. You can also use Map Joins to cache small dimension tables into memory. The join process is completed on the Map side, thus omits the reduce side. To use this feature, however, you need to enable the map-side join setting property: set hive.auto-convert. join=true(default: false). At the same time can also be set up to use the optimization of the size of a small table: set hive. Mapjoin. Smalltable. Filesize = 25000000 (the default value is 25 m)
  3. Mapjoin: The join is completed in the Map phase without reducing.
 Select /*+ MAPJOIN(b) */
            a.key, a.value from a join b on a.key = b.key
Copy the code
2. Association between large tables and large tables

1) For example, a large table is associated with a large table, but one of the tables has many null values or zeros, which is easy to shuffle a Reduce, resulting in slow operation.

Solution: Turn the vacancy into a string plus random number, and distribute the slanted data among different Reduce

Select * from log a left outer join users b on case when a.usser_id is null then concat(' hive ',rand()) else A.usser_id end = b.user_id;Copy the code
3. Avoid association of different data types

Scenario: The user_id field in the user table is an int, and the user_id field in the log table has both string and int types. When a Join operation is performed on two tables based on user_id, the default Hash operation is assigned to a Reducer based on an INT ID. As a result, all records with string ids are assigned to a Reducer.

Consider converting data types to the same data type

select * from user a 
left outer join log b
on a.user_id=cast(b.user_id as string);
Copy the code
4. Avoid many-to-many relationships

During join link query, check whether there is many-to-many association, and ensure that at least one table has no duplicate associated fields in the result set

5. Avoid cartesian products

There is no ON link in the join link query, and a Cartesian set is produced by the WHERE conditional statement

Select A, b.value from A join B where A, b.value from A join B on a. value =b. value

2.2 Computational equalization optimization in Group By

Partial aggregation on the Map side

hive.map.aggr
select user.gender,count(1) from user group by user.gende;
Copy the code

If there is no partial aggregation optimization on the Map side, Map directly sends groupby_key as a reduce_key to Reduce for aggregation, resulting in calculation imbalance. While Map has 1 million records, reduce has only two aggregations, each of which processes 10 billion records.

set hive.map.aggr=true
Copy the code

This parameter is turned on by default, whether the map is aggregated locally during group by. The calculation process after parameters are turned on is shown in the figure above. Because local aggregation has been done on the Map side, although there are still only two Reduce to do final aggregation, each Reduce only needs to process 1 million rows of records, which is 10,000 times smaller than 10 billion before optimization

The Map aggregation switch is turned on by default, but not all aggregations require this optimization. Consider the SQL above, if groupby_key is the user ID, since the user ID is not duplicated, map aggregation does not make much sense and wastes resources.

set hive.groupby.mapaggr.checkinterval = 100000;
set Hive.map.aggr.hash.min.reduction=0.5;
selectuser.. id,count(1) from user group by user.id;
Copy the code

These two parameters control the shutdown policy for the Map aggregation. At the beginning, the Map attempts to hash the first 100000 records. If the number of aggregated records /100000 is greater than 0.5, the groupby_key is not duplicated, so it is meaningless to continue the local aggregation. After 100000, the aggregation switch is automatically turned off.

You will see the following prompt in the map log:

The 2011-02-23 06:46:11, 206 WARN org. Apache. Hadoop. Hive. Ql. Exec. GroupByOperator: Disable the Hash Aggr: Hash table = 99999 #total = 100000 reduction = 0.0 minReduction = 0.52.Copy the code
hive.groupby.skewindata
select user.gender,count(distinct user.id) from user group by user.gender;
Copy the code

Generally, data skews occur when distinct exists. For example, in the SQL above,map aggregation switches are automatically turned off because a map needs to save all user. ids, resulting in unbalanced calculation. Only two Redcue do aggregation, and each Reduce processes 10 billion records.

set hive.groupby.skewindata =true;
Copy the code

The parameter translates the above SQL into two MR’s

The reduce_key of the first MR is gender+ ID. Because ID is a random hash value, the reduce calculation of this MR is very uniform, and reduce completes the work of local aggregation

The second MR completes the final aggregation and statistics the distinct ID values of men and women. Each Map outputs only two records, so it doesn’t matter if there are only two Redcue calculations, most of the calculation has been completed in the first MR

select id,count (distinct gender) from user group by user.id
Copy the code

Hive.groupby. Skewindata is disabled by default. Therefore, you need to manually enable this function if unbalanced data is detected. Of course, not all groups with distinct GROUPS by need this switch on, as in the SQL above. Because user.id is a hash value, it is already computationally balanced, and all reduces are evaluated evenly. This switch should only be turned on if groupBY_key is not hashed and distinct_key hashed, otherwise map aggregation optimization will suffice.

2.3 the count of distinct

If the amount of data is very large, run the select a,count(distinct b) from t group by a command. Type of SQL, data skew problems occur.

Solution: Use sum… Group by Select sum(1) from (select a,b from t group by a,b) group by a;

Try not to use DISTINCT for weight scheduling, especially for large table operations. Use Group by instead

Before optimization: Select count (distinct key) from a

Select sum(1) from (Select key from a group by key) t

3. Adjust the number of Map/Reduce tasks

3.1 Controlling the Number of Maps in Hive Tasks

  • Typically, a job will generate one or more Map tasks from the input directory.

The main determinants are: the total number of input files, the size of the input files, and the block size set by the cluster (currently 128 MB, which can be set in Hive via set dfs.block.size). This parameter cannot be customized.

For example:

A) Suppose there is a file A in the input directory with a size of 780M, then Hadoop will divide the file A into 7 blocks (6 128m blocks and 1 12m block), resulting in 7 maps

B) Assuming that there are three files a, B, and C in the input directory with the size of 10m,20m, and 130m respectively, Hadoop will split them into four blocks (10m,20m,128m, and 2m), resulting in four maps. That is, if the file is larger than the block size (128m), it will split. If the file is smaller than the block size, hadoop will split the file into four maps. The file is treated as a block.

  • Is the more maps the better?

The answer is no. If a task has many small files (much smaller than the block size of 128M), each small file will be treated as a block and completed by a map task, which takes much longer to start and initialize than the logical processing time, resulting in a huge waste of resources. Furthermore, the number of maps that can be executed at the same time is limited.

  • Is it safe to make sure that each map handles close to 128MB of blocks?

The answer is not necessarily. For example, a 127m file would normally be completed with a map, but the file has only one or two small fields and tens of millions of records. If the logic of map processing is complicated, it will definitely be time-consuming to complete with a map task.

To solve the above problems 3 and 4, we need to adopt two ways: reduce the number of maps and increase the number of maps;

3.1.1 How do I Merge Small Files to Reduce the Number of Map Files?

Select count(1) from popt_tbaccountCOPY_mes where pt = ‘2012-07-04’; Inputdir /group/p_sdo_data/p_sdo_data_etl/pt/ popt_tbaccountCOPY_mes /pt=2012-07-04 There are 194 files in total, many of which are much smaller than 128 MB, with a total size of 9 GB. There are 194 Map tasks for normal execution. Total computing resources consumed by Map: SLOTS_MILLIS_MAPS= 623,020

I reduce the number of maps by merging small files before map execution:

set hive.hadoop.supports.splittable.combineinputformat = true;
setmapreduce.input.fileinputformat.split.maxsize=100000000; -- Maximum amount of data to be processed per Mapper task 100M. (This value determines the size of the merged file)setmapreduce.input.fileinputformat.split.minsize.per.node=100000000; -- The minimum split value on a node (this value determines whether files on multiple Datanodes need to be merged)setmapreduce.input.fileinputformat.split.minsize.per.rack=100000000; -- The minimum number of split slices on a switch (this value determines whether files on multiple switches need to be merged)set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
Copy the code

SLOTS_MILLIS_MAPS= 333,500

For this simple SQL task, the execution time may be similar, but the computing resources are saved in half.

Probably explain, 100000000 says 100 m, set hive.input.format=org.apache.hadoop.hive.ql.io.Com bineHiveInputFormat; If the file block size is larger than 128 MB, separate the file block size by 128 MB, if the file block size is smaller than 128 MB, and if the file block size is larger than 100 MB, separate the file block size by 100 MB. Those smaller than 100 megabytes (including small files and the rest separated by large files) were merged to produce 74 blocks.

3.1.2 How to Properly Increase the Number of Maps?

If the input files are large, the task logic is complex, and the map execution is slow, you can increase the number of maps to reduce the amount of data processed by each map and improve the task execution efficiency. Imagine a task like this:

 Select data_desc,
        count(1),
        count(distinct id),
        sum(case when...). .sum(case when...). .sum(...).from a group by data_desc
Copy the code

If table A has only one file with a size of 120M, but contains tens of millions of records, it will definitely be time-consuming to use one map to complete this task. In this case, we should consider reasonably splitting this file into multiple files, so that multiple map tasks can be used to complete the task.

set mapreduce.job.reduces=10;
create table a_1 as 
select * from a 
distribute by rand(123); 
Copy the code

This will randomly distribute the records of table A into table A_1, which contains 10 files. Then replace table A in SQL with A_1, and 10 map tasks will be completed.

With more than 12M (millions of records) of data per map task, efficiency is definitely much better.

One is to merge small files, the other is to split large files into small files, which is where the focus needs to be paid attention to. According to the actual situation, two principles should be followed to control the number of maps: make the appropriate number of maps for large data volume; Make a single Map task process the right amount of data;

3.2 Controlling the Reduce Number of Hive Jobs

3.2.1 How Hive Determines the Number of Reduce Jobs:

Setting the number of Reduce jobs greatly affects the task execution efficiency. If no number of Reduce jobs is specified, Hive determines the number of Reduce jobs based on the following two Settings:

Hive. The exec. Reducers) bytes) per. The reducer (each reduce data processing tasks, defaults to 1000 ^ 3 = 1 g) hive. The exec. Reducers. Max (largest reduce number of each task, the default is 1009)Copy the code

If the user does not actively set the number of mapred.reduce.tasks, the input summary size of all the read files is calculated based on the input directory, and then divided by the value to calculate the Reduce number

   reducers = (int) ((totalInputFileSize + bytesPerReducer - 1) / bytesPerReducer);
   reducers = Math.max(1, reducers);
   reducers = Math.min(maxReducers, reducers);
Copy the code
select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt;
--/group/p_sdo_data/p_sdo_data_etl/pt/ popt_tbaccountCOPY_mes /pt=2012-07-04 The total size is more than 9 GB, so there are 10 reduce entries in this sentence
Copy the code

More reduce is not always better;

Like Map, starting and initializing Reduce consumes time and resources;

In addition, there will be as many output files as there are reduce files. If many small files are generated, the problem of too many small files will also occur if these small files are used as the input of the next task.

3.2.2 Adjusting the Number of Reduce Tasks:

Method one: adjust the hive. The exec. Reducers. Bytes. Per. The reducer for the value of the parameter;

set hive.exec.reducers.bytes.per.reducer=500000000; (500M)select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group bypt; This time there are20A reduceCopy the code

Method 2: Adjust the number of Reduce tasks mapreduce.job.reduces.

set mapreduce.job.reduces = 15;
select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group bypt; This time there are15A reduceCopy the code

3.2.3 When Is There Only One Reduce?

Most of the time, you will find that there is only one Reduce task in a task no matter how large the data volume is and whether you set the parameters to adjust the number of Reduce tasks.

Actually only a reduce task, in addition to the amount of data is less than the hive. The exec. Reducers. Bytes. Per. The reducer for parameter values, and the following reasons:

A) No group by summary

Select count(1) from popt_tbaccountCOPY_mes where pt = ‘2012-07-04’ group by pt; Select count(1) from popt_tbaccountCOPY_mes where pt = ‘2012-07-04’; This is very common, and I hope you can rewrite it as much as possible.

B) Order by

C) It has a Cartesian product

Usually in these cases, I don’t have any good solutions except to find ways to work around and avoid them, because these operations are global, so Hadoop has to use a Reduce to complete them. Similarly, when setting the number of Reduces, the following two principles should also be taken into account: Make the large amount of data use an appropriate number of Reduces; Enable a single Reduce task to process an appropriate amount of data.

4. Merge small files

4.1 Map Input Merging

The same as 3.1 How can I Merge Small Files to Reduce the Number of Maps

4.1 Automatically merge output small files

set hive.mergejob.maponly=true(the default istrue)
If hadoop version supports CombineFileInputFormat, map-only job for merge is enabled, otherwise MapReduce Merge job is enabled, map-side combine file is more efficient

set hive.merge.mapfiles=true(the default istrue)
-- After a normal map-only job is performed, whether to start a merge job to merge the map output

set hive.merge.mapredfiles=true(the default isfalse)
- After normal map-reduce jobs are performed, whether to enable merge jobs to merge results on the Reduce end. You are advised to enable merge jobs

setHive. Merge. Smallfiles. Avgsize (by default16MB)
If partitioned tables are not used, the average size of output table files is smaller than this value. Start the merge job. If partitioned tables are used, calculate the average size of files under each partition. Merge only partitions whose average size is smaller than this value. This value is valid only when hive. Merge. mapfiles or hive. Merge. mapredfiles is set to true

setHive. Merge. The size, per task (default is256MB)
- targetSize of each file after merge job. Divide the total size of the output file of the previous job by this value to determine the number of reduce files for merge job. The Map end of a Merge job is equivalent to identity Map, and shuffle to Reduce. Each Reduce job dumps a file to control the number and size of files
Copy the code

5. Hive TEz insert Union all problem

Problem description

When teZ mode is used in Hive, the output result of TEZ is found in the corresponding table directory and a subdirectory is generated. As a result, the Hive client without TEZ configuration cannot obtain data from the table when it reads the table.

Check the output directory of the table and find two subdirectories under the partition directory :1 and 2:

/user/hive/test1/20170920000000/1
/user/hive/test1/20170920000000/2
Copy the code

why

Tez will optimize the insert union operation by speeding up the speed through parallelism. In order to prevent the same file output, the parallel output is generated into a sub-directory, and the results are stored in the sub-directory. If all Hive client engines and related engines are set to TEZ, there is no problem. If a client is still using the Mr Engine, the data cannot be read. In Hive, the following policies are used to store information in table directories:

By default, Hive can only read files in the corresponding table directory.

1) If both directories and files exist in the table directory

When hive is used to read data, an error is reported indicating that the directory is not a file

2) There are only directories in the table directory

Hive does not perform in-depth recursive query, but only reads the corresponding query directory, and the query result is empty

3) There are only files in table directory

Can be queried normally

In Hive on TEz, the insert union operation is optimized to be output in parallel to the corresponding table directory. To prevent the existence of files with the same name, a directory is configured for each table directory to store the execution results

This directory is readable to the TEZ engine client. For Mr Engines, however, it simply traverses the corresponding partition layer. Returns null when no files are found under the partition

solution

Can open the graphs of recursive query mode: set graphs. Input. Fileinputformat. Input. Dir. Recursive = true

In the hive, set up after the query is executed, will prompt error, requirements will set you right. Mapred. Supports. The subdirectories = true; After two sets, hive can access the data generated by the TEZ engine.

At the same time, in order to check the compatibility of teZ to other engines in this case, sparktThriftServer SQL access is tested:

For sparkthriftserver, simply add the set graphs. Input. Fileinputformat. Input. Dir. The recursive = true, This section describes the differences between sparkThriftServer and Hive

1) When there are files and directories under the table directory (non-partitioned directory)

Both sparkThriftServer and Hive report a non-file error

2) All directories under table directories (non-partitioned directories)

In this case, sparkThriftServer reports a non-file error, and Hive has no error, but the query is empty

The solution

1) Solution 1: Replace all Hive terminals with teZ engines

This does not cause the above problems, but does affect other engines such as SPQRK, Mr, Presto, etc

2) Solution 2: Set recursive read for Hadoop and Hive

This scheme is relatively feasible, but the effect on ordinary Mr Tasks needs to be investigated