Hive Data Organization and Management mode (5)

Hive data organization and management mode

The Database and Table

A Database is also called a namespace. It is used for organization and naming conflicts, such as table name conflicts and view conflicts

There are two databases stored in hive(HDFS), but both of them have their own.db suffix. If you go to the command line, there is no such suffix

Database and Table are both folders, and they are both folders. Later, when you learn about partitioning, you will see that partitioning is also the same way

Note that the folder name is the table name without the suffix

Partition and bucket division

We know that the traditional DBMS system generally has the function of table partition, through table partition can retrieve data in a specific area, reduce the cost of scanning, improve query efficiency to a certain extent, of course, we can further improve query efficiency by establishing indexes on partitions. I won’t repeat it here. For example, there are partitioned tables in mysql, but we don’t use them very often, so we ignore them.

A partitioned table has the same logical distinction as an unpartitioned table. A partitioned table stores data in subdirectories of the table directory based on the column value of the partition key. Directory name = Partition key = key Value. It should be noted that the partition key value does not have to be based on a column (field) of the table, it can specify any value, as long as the corresponding partition key is specified to query. We can add, delete, rename, and empty partitions. Because partitioning retrieves data in a specific area (subdirectory), it serves the same purpose as DNMS partitioning to reduce scan costs.

Bucket partitioning is to specify a column in the bucket partitioning table and distribute the data in the column randomly and evenly to each bucket file according to the mode of hashing. The bucket partitioning operation requires the hash mode operation based on a specific column data. Therefore, the bucket partitioning column must be based on a specific column (field) in the table. Because bucket splitting changes the way data is stored, it hashes rows that are identical or within a certain range into the same bucket file. In this way, query efficiency can be improved. For example, when we want to JOIN two tables with buckets in the same column, we only need to JOIN the buckets with the same column values. Barrel splitting can also make Sampling more efficient.

In a word, buckets are divided to improve query efficiency, and the principle is to reduce data scanning

partition

Refers to one or some columns according to the data table is divided into multiple zones, zone from the form can be understood as a folder, such as we want to collect a log data of large site, a site daily log data exist on the same table, because every day can generate large amounts of log, led to huge, the content of the data table in the query to a full table scan use resources very much. In fact, in this case, we can partition the data table according to the date, the data of different dates are stored in different partitions, as long as you specify the value of the partition field, you can directly search from the partition.

When creating a partitioned table, use the keyword Partitioned BY (Name String) to declare that the table is partitioned and partitioned according to column Name. All records with the same value of Name are stored in a partition, and the type of partition attribute name is string. Of course, it is possible to partition by more than one column, that is, to continue partitioning data from one partition by some column.

Partition (name= “Jack”) specifies the partition to which the data is to be imported. This indicates that the data is to be imported into the partition whose partition is name= Jack.

The so-called partition is to pack the records that meet certain conditions and put them in the same folder to improve the efficiency of query. It is equivalent to classifying files according to folders. The folder name can be analogous to the partition field.

This partitioned field exists in the data table formally, when the query will be displayed on the client, but is not really stored in the data table file, is called pseudo column. But there is another concept that is somewhat counter to this: dynamic partitioning, where a partitioned column is usually one or more columns in a table

CREATE TABLE ods.u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
 partitioned by(year string,month string ,day string) 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
Copy the code

So let me load it a few more times to see what happens

LOAD DATA LOCAL INPATH '/Users/liuwenqiang/ml-100k/u.data' OVERWRITE INTO TABLE ods.u_data partition(year='2020'.month='2020-12'.day='2020-12-21');
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/ml-100k/u.data' OVERWRITE INTO TABLE ods.u_data partition(year='2020'.month='2020-12'.day='2020-12-22');
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/ml-100k/u.data' OVERWRITE INTO TABLE ods.u_data partition(year='2020'.month='2020-12'.day='2020-12-23');
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/ml-100k/u.data' OVERWRITE INTO TABLE ods.u_data partition(year='2021'.month='the 2020-01'.day='2021-01-01');
Copy the code

The first thing we see is that the physical implementation of a partition is a folder, the name of the partition partition name =’ partition value ‘

Multi-field partitioning is managed by nesting files down to the minimum partition

In fact, the above way according to the time partition is our most common, we pay attention to and the following way to distinguish, you can consider why, and do not understand the comment area message

Significance of Zoning

The significance of partitioning is to better organize and manage data. Partitioning conditions can be added to reduce the amount of data to be scanned and retrieved during query, so as to improve the efficiency of query.

Matters needing attention

Do not create too many partitions that affect the ability to load and retrieve data
If the number of partitions is larger than 1GB, adjust the memory size of HiveServer2 and MetaStore
Do not nest partitions too deeply – there are many partitions
If you want to insert data into multiple partitions, you need to enable dynamic partitioning
Require your partition key will not appear a large number of small partition, and the key partition is the best uniform

Points barrels

Bucket division is a finer – grained division than partition. Buckets divide the entire data content according to the hash value of a column of attribute values. If the name attribute is to be divided into three buckets, the hash value of the name attribute value is taken by 3, and the data is divided into buckets according to the modulus result. For example, the data record of module 0 is stored in a file, the data of module 1 is stored in a file, and the data of module 2 is stored in a file.

In fact, about this finer granularity, to be honest, I always disagree, why? Say before I said the first, why everyone say points bucket is a finer way of organization, because most of the time we split barrels are used with partition, that is to say, we are usually finished partition, barrels for points in a single partition, so we say is a more granular, however, from another Angle to see your problem, If I do dynamic partitioning in accordance with the table of a field, it enters the field values of the same partition is the same data, what would you do if for this field points barrels, this time depends on how much your bucket, if your barrel is less, then the same data points in the barrel field value is not necessarily equal, only surplus is equal to the number of barrels.

The performance improvement of bucket query is reflected in join and convenient data sampling

So this raises the question that if you go to buckets without partitions, you don’t have to think about buckets first

CREATE TABLE ods.u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
 partitioned by(year string,month string ,day string) 
 CLUSTERED BY (`movieid` ) INTO 3 BUCKETS      
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/ml-100k/u.data' OVERWRITE INTO TABLE ods.u_data partition(year='2020',month='2020-12',day='2020-12-21');
Copy the code

Here’s how the data is organized, and we see that the data is split into three files according to the MovieID field

The meaning of buckets

Obtain higher query processing efficiency. Buckets add an extra structure to tables that Hive can use to handle some queries. In particular, joining two tables that have buckets on the same column (containing join columns) can be implemented efficiently using a Map-sidejoin. Consider the JOIN operation. For JOIN, two tables have the same column if buckets are performed on both tables. In this case, the bucket with the same column value can be joined, which can greatly reduce the amount of JOIN data.
Make sampling more efficient. When working with large datasets, it is convenient to be able to test run queries on a small portion of the dataset during the development and modification phases of the query
In the case of map-side connections, both tables divide buckets in the same way. A Mapper that processes a bucket in the left table knows that the matching row in the right table is in the corresponding bucket. Therefore, mapper only needs to grab that bucket (which is just a small part of the data stored in the table on the right) to join. This optimization method does not necessarily require the two tables to have the same number of buckets, but the multiple relationship between the two tables can also be used.
It can be seen that if buckets are well used in sampling and join, our efficiency can be greatly improved

Matters needing attention

Pay attention to data skew

Select a column with a large number of unique values as the bucket partitioning key
Is a slanted value. Create buckets separately. You can use list buckets

The number of barrels

Generally, one CPU writes data to only one bucket. Therefore, ensure that the number of buckets is large
For joins, ensure that the bucket keys are the same

conclusion

Both databases and tables and buckets are used to organize and manage data to improve efficiency
Partition is not necessarily based on column, can be specified any value, can also be based on column dynamic partition, save in Hdfs is a directory
Buckets are hash based on column values, and bucket data is the number of HDFS files. Two tables with the same buckets can be directly joined on the map side, and the Cartesian product is also greatly reduced. If the bucket field doesn’t exist you getSemanticException [Error 10002]: Invalid column reference 'XXX'The abnormal
Partitions are managed as folders, and buckets are managed as files

Hive Data Organization and Management mode (5)