Optimization 1: Enable local mode

Most Hadoop jobs require the full scalability provided by Hadoop to handle large data sets. However, sometimes Hive inputs are very small. In this case, the execution of the task triggered by the query may take much longer than the actual job execution time. For most of these cases, Hive can handle all tasks on a single machine in local mode. For small data sets, the execution time can be significantly reduced.

If the local mode is enabled for a single task, you can directly execute the following set statement on the command line and then execute the HQL statement. Hiverc: $HOME/. Hiverc: $HOME/. Hiverc: $HOME/. Hiverc: $HOME/. Hiverc: $HOME/. Hiverc: $HOME/. If you want all users to use this configuration, write the following configuration directly to the hive-site.xml file.

set hive.exec.mode.local.auto=true; / / open local Mr, sethive. The default is off false. The exec mode. Local. Auto. Inputbytes. Max = 50000000; // Set the maximum amount of data to be input by the local Mr. When the amount of data to be input is smaller than this value, the local Mr Mode is used. The default value is 134217728, that is, 128 MB. Now set its value is 50000000, not for 128 MB of integer times set hive. The exec. Mode. Local. Auto.. Input files. Max = 10; // Set the maximum number of local Mr Files. When the number of local Mr Files is smaller than this value, the local Mr Mode is used. The default value is 4Copy the code

Optimization 2: Turn on strict mode

Hive provides a strict pattern that prevents users from executing queries that may have unintended adverse effects.

By setting the attribute hive.mapred.mode to the default non-strict mode nonstrict. To enable strict mode, change the value of hive.mapred.mode to strict. If strict mode is enabled, the three types of query are prohibited.

<property>
    <name>hive.mapred.mode</name>
    <value>strict</value>
    <description>
      The mode in which the Hive operations are being performed.
      In strict mode, some risky queries are not allowed to run. They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
    </description>
  </property>
Copy the code

1) For partitioned tables, execution is not allowed unless the WHERE statement contains partition field filtering conditions to limit the range. In other words, the user is not allowed to scan all partitions. The reason for this restriction is that typically partitioned tables have very large data sets that grow rapidly. Queries without partitioning restrictions could consume an unacceptably large amount of resources to process the table.

SQL > alter table order by; Because order by will distribute all result data to the same Reducer for processing in order to perform the ordering process, forcing the user to add this LIMIT statement can prevent the Reducer from executing for a long time.

3) Restrict queries for Cartesian products. Users familiar with relational databases may expect to use a WHERE statement instead of an ON statement when executing a JOIN query, so that the relational database’s execution optimizer can efficiently convert the WHERE statement into that ON statement. Unfortunately, Hive does not perform this kind of optimization, so if the table is large enough, the query will get out of hand

Optimization three: Fetch

Fetch refers to that Hive queries can be performed without MapReduce computing. For example, SELECT * FROM employees; In this case, Hive could simply read the files in the employee storage directory and output the query results to the console.

In the hive – default. XML. The template file hive. Fetch. Task. The default is more just, old version hive default is minimal, this attribute is modified to more later, Mapreduce is not used in global search, field search, and limit search.

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have
      any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
  </property>
Copy the code

Field: the hive. The fetch. Task. Just set up more, and then execute the query, will perform program.

    hive (fdm_sor)>set hive.fetch.task.conversion=more;

    hive (fdm_sor)>select \* from SOR_EVT_TB_REPAY_PLAN_HIS;

    hive (fdm_sor)>select name,account  from SOR_EVT_TB_REPAY_PLAN_HIS;

    hive (fdm_sor)>select name,account  from SOR_EVT_TB_REPAY_PLAN_HISlimit 3;
Copy the code

Optimization 4: Enable parallel execution

Hive converts a query into one or more phases. These phases can be the MapReduce phase, the sampling phase, the merge phase, or the Limit phase. Or other phases that may be required during Hive execution. By default, Hive executes only one phase at a time. However, a particular job may contain many phases, which are not completely interdependent. In other words, some phases can be executed in parallel, shortening the execution time of the entire job. However, the more phases that can be executed in parallel, the faster the job is likely to complete.

Parallel can be enabled by setting the hive.exec.parallel parameter to true. However, in a shared cluster, if the number of parallel phases in a job increases, the cluster utilization increases.

set hive.exec.parallel=true; / / open the task parallel execution set hive. The exec. Parallel. Thread. Number = 16; // The maximum parallelism allowed for the same SQL is 8 by default.Copy the code

Of course, only when the system resources are relatively free will have an advantage, otherwise, no resources, parallel will not get up.

Optimization 5: column and column filtering, code optimization

Column processing: In SELECT, only the required column, if there is, try to use partition filtering, SELECT * less

Row processing: in partition clipping, when using external association, if the filter condition of the secondary table is written after Where, then the whole table will be associated first and then filtered, which is inefficient, and can be directly sub-query and then association. (This is very useful, the actual development of large data processing efficiency will be much higher, use subquery association)

Case practice:

(1) The test associates two tables first, and then uses the WHERE condition to filter

hive (default)> select o.id from bigtable b join ori o on o.id = b.id where o.id <= 10; Timetaken: 34.406 seconds, Touchless: 100 row(s)Copy the code

(2) After passing the sub-query, then associate the table

hive (default)> select b.id from bigtable b join (select id from ori where id <= 10 ) o on b.id =o.id; Timetaken: 30.058 seconds, Touchless: 100 row(s)Copy the code

But this is not always true for left-out associations and right-out associations. Do not use this table if there are null values in the table.

Let’s write the difference between where clauses and subqueries.

select  b.due_bill_no
from  
(
select distinct due_bill_no  from FDM_DM.DM_PLSADM_FEEDBACK_LOAN_RECOVERY_INFO_M where statis_date = '20180228'
) b  
left  join 
(select distinct loan_no from FDM_DM.tmp_dmp_plsadm_tradeinfo_m_report where statis_date ='20180228') a
on   b.due_bill_no  =a.loan_no 
where  a.loan_no  is  null
Copy the code

SQL > select due_bill_no, due_bill_no, due_bill_no, due_bill_no, due_bill_no, due_bill_no, due_bill_no, due_bill_no

select b.due_bill_no from ( select distinct due_bill_no from FDM_DM.DM_PLSADM_FEEDBACK_LOAN_RECOVERY_INFO_M where statis_date = '20180228' ) b left join (select distinct loan_no from FDM_DM.tmp_dmp_plsadm_tradeinfo_m_report where Statis_date ='20180228' and LOan_no is null) a on B.due_bill_no = A.loan_noCopy the code

Optimization 6: Enable JVM reuse

JVM reuse is a Hadoop tuning parameter that has a significant impact on Hive performance, especially for scenarios where it is difficult to avoid small files or where there are a large number of tasks, most of which have very short execution times.

The default configuration of Hadoop is usually to use derived JVMS to perform Map and Reduce tasks. The JVM startup process can be quite expensive at this point, especially if the job executed contains hundreds or thousands of tasks. JVM reuse can cause JVM instances to be reused N times in the same job. The value of N can be configured in Hadoop’s mapred-site. XML file. Usually between 10 and 20, depending on business scenario testing.

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit.
  </description>
</property>
Copy the code

The downside of this feature is that enabling JVM reuse will hold up the used Task slot for reuse until the task is complete. If several Reduce tasks in an unbalanced job take longer to execute than other Reduce tasks, the reserved slots will remain idle and cannot be used by other jobs until all tasks are finished.

Optimization 7: Merge small files

Merge small files before running a Map to reduce the number of Maps: CombineHiveInputFormat Combines small files (the default format). HiveInputFormat does not merge small files.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
Copy the code