1. Differences between Internal tables and external tables

  • Managed tables are external tables, and managed tables are external tables.
  • Internal table data is managed by Hive, and external table data is managed by HDFS.
  • The location of the internal table data storage is a hive. Metastore. Warehouse. Dir (default: /user/hive/warehouse). If there is no LOCATION, hive will create a folder with the name of the external table in /user/hive/warehouse on HDFS and store the data belonging to the table in this folder.
  • Deleting an internal table directly removes metadata and stored data. Deleting an external table only deletes metadata, not files on HDFS.
  • MSCK REPAIR TABLE table_name; MSCK REPAIR TABLE table_name;
  • To change an external table to take effect, you need to transfer the external table to the internal table, modify the external table, and then transfer the external table.

2. How does Hive implement partitioning?

Construction of predicate sentences; create table tablename (id) partitioned by (dt string)

Alter table tablenname add partition (dt = ‘2016-03-06’)

Alter table tablename drop partition (dt = ‘2016-03-06’)

3. What are the advantages and disadvantages of Hive metadata storage methods?

This method can only be used to start one Hive client and is not recommended

It is stored in the mysql database and can be connected to multiple clients

4. Differences and relationships between Order BY, distribute by, sort BY, and Cluster BY in Hive

order by

Order BY performs global ordering of data. Like order BY in Oracle, mysql and other databases, order BY is performed only in one Reduce, so it is very inefficient when the data volume is very large. Set hive.mapred.mode=strict: limit must also be specified

sort by

Sort by is sorted separately in each reduce, so global order cannot be guaranteed. Generally, sort by is executed with distribute BY, and distribute BY must be written before sort BY. Mapred.reduce. tasks=1 has the same effect as order by. If mapred.reduce.tasks=1, the output will be divided into several files, and each file will be sorted by the specified field. Sort by is not affected by whether hive.mapred.mode is strict or nostrict.

distribute by

DISTRIBUTE BY controls how the output in the map is divided in the Reducer. DISTRIBUTE BY ensures that records with the same KEY are allocated to a Reduce.

cluster by

Distribute BY and sort BY are equal to Cluster BY. However, cluster BY cannot be specified as asC or DESC, but can only be sorted in ascending order.

5. What are the differences between the compression formats RCFile, TextFile, and SequenceFile?

TextFile: the default format, which does not compress data and costs a lot of disk and data parsing

SequenceFile: Hadoop API provides a binary file support, easy to use, can be divided, can be compressed, support three kinds of compression, NONE, RECORD, BLOCK.

RCFILE: is a combination of row and column storage. First, the data is divided into blocks according to the row to ensure that the same record is in the same block, avoiding reading a record to read multiple blocks. Secondly, block data column storage is beneficial to data compression and fast column access. High performance consumption when data is loaded, but good compression ratio and query response.

6. How to optimize Hive?

  • Join optimization, try to place small tables on the left of join, if a table is small, mapJoin can be used.
  • Sort optimization, order by a reduce is inefficient, distirbute by +sort by can also achieve global sort.
  • Partitioning saves time by reducing the retrieval of data during queries.

7. Massive data is distributed among 100 computers. How to efficiently collect the top10 of this batch of data

Plan 1:

  1. To find the TOP10 on each computer, you can use a heap of 10 elements (TOP10 small, use the largest heap, TOP10 large, use the smallest heap).
  2. For example, to find the TOP10, we first take the first 10 elements and adjust them to the minimum heap. If it is found, we scan the following data and compare it with the top element. If it is larger than the top element, we replace the top element with that element and then adjust it to the minimum heap.
  3. The last element in the heap is the TOP10

Scheme 2

  1. After figuring out the TOP10 on each computer, combine the TOP10 on those 100 computers for a total of 1,000 data points
  2. Use a similar method above to figure out the TOP10.

8. Differences between row_number(), rank(), and dense_rank()

The ROW_NUMBER() function is used to sort the data from the select query, and add a serial number to each data. It cannot be used for ranking students’ grades, but is usually used for paging query. The RANK() function, as the name indicates, can RANK a field. Why is this not the same as ROW_NUMBER()? ROW_NUMBER() is used to sort students with the same score. The Rank() is different. The Rank() is the same. They Rank the same. The DENSE_RANK() function is also a ranking function, and is similar to RANK() in that it ranks fields.

9. Open Windows in Hive

Windowing functions are generally used for data analysis, calculating some aggregate value based on a group.

The difference with an aggregate function is that it returns multiple rows per group, whereas an aggregate function returns only one row per group. The windowing function specifies the size of the data window in which the analysis function works. This data window size may vary depending on the row!

Infrastructure: Parsing functions (e.g. Sum (), Max (),row_number()…) + window clause (over function)

Sum () over(partition by user_id order by order_time desc) over Partition by cookieID order by createTime partition by cookieID order by createTime partition by cookieID

Sum () = sum(); sum() = sum();

Analysis functions include: AVg (),min(), Max (),sum()

Sort functions: row_number(), rank(), dense_rank()

10. Local mode

Most Hadoop jobs require the full scalability provided by Hadoop to handle large data sets. However, sometimes Hive inputs are very small. In this case, the execution of the task triggered by the query may take much longer than the actual job execution time. For most of these cases, Hive can handle all tasks on a single machine in local mode. For small data sets, the execution time can be significantly reduced.

Users can set the hive. The exec. Mode. Local. Auto value is true, to allow the hive automatically start the optimization at the appropriate time.

set hive.exec.mode.local.auto=true; // Enable local Mr // Set the maximum amount of data to be input by local Mr. When the amount of data to be input is smaller than this value, local Mr Is used. The default value is 134217728. Which is 128 m set hive exec. Mode. Local. Auto. Inputbytes. Max = 50000000; / / set the maximum number of the input file of local Mr, when the input file number is less than this value with the method of local Mr, silently think 4 set hive. The exec. Mode. Local. Auto.. Input files. Max = 10;Copy the code

11. Convert HQL into MR task flow

1. Enter the program, use Antlr framework to define the syntax rules of HQL, complete the syntax analysis of HQL, and convert HQL into AST (Abstract syntax tree); 2. Iterate through AST to abstract the basic component unit of query QueryBlock, which can be understood as the smallest query execution unit; 3. Iterate over QueryBlock and convert it to OperatorTree, which can be understood as an indivisible logical execution unit; 4. Use the logic optimizer to logically optimize the OperatorTree. For example, merge unnecessary Reduce tasks to Reduce shuffle data. 5. Iterate over the OperatorTree and convert it to TaskTree. That is, the sixth layer of the MR task, which translates the logical execution plan into the physical execution plan; 6. Use force optimizer to optimize TaskTree force; 7. Generate a final execution plan and submit the task to the Hadoop cluster for execution.Copy the code