Read Hive data storage format

“This is the 12th day of my participation in the Gwen Challenge in November. See details of the event: The Last Gwen Challenge 2021”.

This document describes Hive data stores, which are the basis for Hive data operations. Choosing an appropriate underlying data store file format can increase performance by an order of magnitude without changing current Hive SQL. This optimization method is not unfamiliar to those who have learned MySQL and other relational databases. Choosing different data storage engines represents different data organization methods, which will have different effects on the performance of the database.

Common Hive data storage formats are as follows:

Row storage:

TextFile

Binary serialization file
Column storage:

Determinant files (RCfiles)

Optimized determinant file (ORCFile)

Apache Parquet

Note: RCFile and ORCFile are not pure column storage. Instead, they group data tables based on rows (row groups), and then row groups are determinant storage

Let’s take a look at the pros and cons of each storage structure:

Horizontal row storage structure:

The most common pattern is to store an entire row together, containing all columns. This structure can well adapt to dynamic query.

For example, select a from tableA and select a, B, C, D, e, f, G from tableA.

In this case, the data in the same row is on the same HDFS block, so the cost of rebuilding a row of data is low.

But there are two major weaknesses to this:

When there are many columns in a row and we only need a few of them, we also have to read all the columns in the row and then extract some of them. This greatly reduces the efficiency of query execution.
When multiple columns are compressed, the compression ratio is not too high due to different column data types and value ranges.

Vertical column storage structure:

Column storage is to store each column individually or to store several columns together as column groups. Column stores avoid reading unnecessary columns when performing queries. In addition, the data type of the same column is generally the same, and the value range is smaller than that of multiple columns. In this case, the compression ratio of the data can be relatively high.

However, this structure can be difficult to reconstruct rows, especially if multiple columns of a row are not on a SINGLE HDFS block. For example, if we take Column A from the first DataNode, column B from the second DataNode, and column C from the third DataNode, when we want to put A, B, and C into A row, You need to put these three columns together to rebuild the trip, which requires a lot of network overhead and computing overhead.

Hybrid PAX storage structure:

The PAX structure combines row storage and column storage. It is used to improve CPU cache utilization in traditional databases and cannot be directly used in HDFS. But RCFile and ORC are inherited from the idea of storing rows first and then columns.

Let’s take a look at some common storage formats in Hive:

This article focuses on the last two, Apache ORC and Apache Parquet, because they are widely used in real-world production environments with their efficient data storage and data processing capabilities.

A TextFile,

The TextFile format is the default Hive format. If the format is not specified during table creation, the TextFile format is used by default. When importing data, the data file is directly copied to the HDFS and no processing is performed.

Create Hive table in TextFile format:

create table if not exists textfile_table
(
    ueserid STRING,
    movieid STRING,
    rating STRING,
    ts STRING
)
row formated delimated fields terminated by '\t'
stored as textfile;  -- Optional (default format)
Copy the code

Load data into TextFile table:

load data local inpath "/root/rating.csv" overwrite into table textfile_table
Copy the code

Advantages and disadvantages of TextFile:

The TextFile format does not process the imported data files. Therefore, data can be directly loaded in load mode. Other storage formats cannot load data files directly. So TextFile is the fastest to load.

TextFile format although the Gzip compression algorithm can be used, the compressed file does not support split. During deserialization, it is necessary to determine character by character whether it is a delimiter or an end-of-line character. Therefore, deserialization costs tens of times more than SequenceFile.

Second, the SequenceFile

SequenceFile is a binary file support provided by the Hadoop API. It is easy to use, split, and compressed.

The internal format of SequenceFIle depends on whether compression is enabled, which can be divided into record compression and block compression.

No compression (NONE) : If compression is not enabled (the default) then each record consists of its record length (bytes), key length, key, and value. The length field is 4 bytes.

RECORD compression: The RECORD compression format is basically the same as the uncompressed format, except that the value bytes are compressed by the encoder defined in the header. Note: Keys are not compressed.

BLOCK compression: BLOCK compression compresses multiple records at once, so it is more compact than record compression and is generally preferred. When the number of bytes recorded reaches the minimum size, it is added to the block. The minimum value by io.seqfile.com. Press the blocksize property definitions. The default value is 1000000 bytes. The format is number of records, key length, key, value length, value. The compression rate of Record is low. BLOCK compression is recommended.

Create a SequenceFile Hive table:


create table if not exists seqfile_table
(
    ueserid STRING,
    movieid STRING,
    rating STRING,
    ts STRING
)
row format delimited
fields terminated by '\t'
stored as sequencefile;
Copy the code

Set the compression format to block compression:

set mapred.output.compression.type=BLOCK;
Copy the code

Load data into SequenceFile table:

insert overwrite table seqfile_table select * from textfile_table;
Copy the code

SequenceFile advantages:

Supports data compression based on Record or Block.
Supports Splitable, which can be used as input fragments for MapReduce.
Simple modification: Mainly responsible for modifying the corresponding business logic, regardless of the specific storage format.

Disadvantages of SequenceFile:

A process is required to merge files, and the merged files are not easy to view.

Third, RCFile

RCFile is a Hive file storage format developed by FaceBook. The table is divided into several row groups, and the data in each row group is stored in columns. The data in each column is stored separately, which is exactly the concept of horizontal partition and vertical partition.

The table is first divided into rows and divided into multiple row groups. A row group consists of:

The 16-byte HDFS block synchronization information is mainly used to distinguish between adjacent row groups on one HDFS block.
The header information of metadata mainly includes the number of rows stored in the row group, the field information of the column and so on.
In the data part we can see that RCFile stores each row as a column and each column as a row, because when the table is large and we have a lot of fields, we usually only need to fetch a fixed column.

In a normal row store, select a from table will only fetch the value of a single field, but it will still traverse the entire table, so the effect is the same as select * from table. In RCFile, as mentioned earlier, only one row of the row group will be read.

Create a table for RCFile:


create table if not exists rcfile_table
(
    ueserid STRING,
    movieid STRING,
    rating STRING,
    ts STRING
)
row format delimited fields terminated by '\t'
stored as rcfile;
Copy the code

In storage space:

RCFile is row partition, column storage, run encoding, the same data will not be stored repeatedly, greatly saving storage space, especially when the fields contain a large number of duplicate data.

Lazy loading:

Data stored in tables is compressed data. Hive decompresses data when reading data, but skips unnecessary columns for specific queries, eliminating unnecessary column decompression.

Such as:

select c from table where a>1;
Copy the code

For row groups, column A of a row group is decompressed, if the current column has a value of A >1, then c is decompressed. If there is no column a>1 in the current row group, skip the entire row group without decompressing C.

Four, ORCFile

1. Advantages of ORC compared with RCFile

ORC extends RCFile to some extent and is an optimization of RCFile:

ORC extends RCFile compression, in addition to run-length, by introducing dictionary and Bit encoding.
Each task outputs only a single file to reduce NameNode load.
Support for complex data types such as datetime, Decimal, and complex types (struct, list, map, etc.);
Files are Split, “Split.” Using ORC as the table file storage format in Hive not only saves HDFS storage resources, but also reduces the input data of query tasks and mapTasks.

With dictionary encoding, the final stored data is the value in the dictionary, the length of each dictionary value and the position of the field in the dictionary; If it is null, the Bit value is saved as 0; otherwise, the Bit value is saved as 1. For the null field, it does not need to be stored during the actual encoding. That is to say, if the field is NULL, it does not occupy storage space.

2. Basic structure of ORC

ORCFile refers to Stripe and Footer based on RCFile. Each ORC file is first cut horizontally into multiple stripes, and each Stripe is stored internally in columns. All columns are stored in one file, and the default size of each Stripe is 250MB, as opposed to the default row group size of RCFile, which is 4MB. So it’s more efficient than RCFile.

The following is an ORC file structure diagram:

The ORC file structure consists of three parts:

Stripe: Where ORC files store data.
File footer: Contains a list of the stripes in the file, the number of rows in each stripe, and the data type of each column. It also contains aggregate information about the minimum, maximum, row count, sum, and so on for each column.
Postscript: Contains information about compression parameters and compression sizes.

The stripe structure can also be divided into three parts: index Data, Rows Data, and stripe footer:

Index Data: Stores statistics about the stripe and data location index in the stripe.
Rows Data: A place where data is stored, consisting of groups of rows, and data is stored as streams.
Stripe footer: Indicates the directory where data is stored.

Rows Data stores two parts of data, metadata Stream and data stream:

Metadata Stream: Metadata information used to describe each row group.
Data stream: a place where data is stored.

ORC provides three levels of indexing in each file:

File level: This level of index information records the location of all stripes in the file and the statistics for each column of data stored in the file.
Stripe level: This level records statistics about data stored in each stripe.
Row group level: Every 10,000 rows in the stripe constitute a row group. The index information at this level is the statistics about the data stored in the row group.

Programs can use indexes provided by ORC to speed up data lookup and reading. When the program queries the table of ORC file type, it first reads the index information of each column, compares the search conditions with the index information, and finds the file that meets the search conditions.

Then, according to the index information in the file, find the stripe that stores the query condition data, and read all the stripe blocks that meet the query condition based on the index information of the stripe.

Based on the index information of each row group in the stripe and the comparison result of query conditions, you can find the row group that meets the requirements.

Using ORC indexes, you can quickly locate data blocks that meet query requirements and avoid most files and data blocks that do not meet query conditions. Compared with reading traditional data files, you need to traverse all data during search. Using ORC can avoid disk and network I/O waste and improve the search efficiency of programs. Increase the workload of the entire cluster.

3. ORC data type

When Hive uses ORC files to store data, the field information, field type information, and code of the data are stored together with the data stored in ORC.

Data in each block of ORC is self-describing, independent of external data, and not stored in Hive metadata.

The data data types provided by ORC include the following:

Integer: contains Boolean (1bit), tinyINT (8bit), SmallINT (16bit), int (32bit), and BigINT (64bit).
Floating-point: contains floats and doubles.
String type: contains string, char, and vARCHar.
Binary type: contains binary.
Date and time type: contains timestamp and date. ,
Complex types: contains struct, list, map, and Union types.

At present, ORC is basically compatible with most of the field types used in daily use. In addition, all types in ORC accept NULL values.

4. ACID transaction support for ORC

Before Hive 0.14, Hive table data can only be added or deleted as a whole partition or table, but cannot be modified on a single record of the table.

After Hive 0.14, ORC files ensure that ACID transactions for atomicity, consistency, isolation, and durability of Hive at work are used correctly, making data update operations possible.

Hive is OLAP oriented, so its transactions are somewhat different from those of RDMBS. Hive transactions are designed to update large amounts of data per transaction, rather than frequently updating small amounts of data.

SQL > create Hive transaction table

Setting Hive environment parameters:

 -- Enable concurrency support for insert, delete, and update transactions
set hive.support.concurrency=true;

The table that supports ACID transactions must be a bucket table
set hive.enforce.bucketing=true;

To enable things, you need to enable dynamic partitioning in non-strict mode
set hive.exec.dynamic.partition.mode=nonstrict;

- set up firm management type is org. Apache. Hive. Ql. Lockmgr. DbTxnManager
-- the original org. Apache. Hadoop. Hive. Ql. Lockmgr. DummyTxnManager does not support the transaction
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

-- Start the thread that runs initialization and cleanup in the same meatore instance
set hive.compactor.initiator.on=true;

Set the number of threads each metaStore instance runs on
set hive.compactor.worker.threads=1;
Copy the code

Create a table:

create table student_txn
(id int,
 name string
)
clustered by (id) into 2 buckets - Bucket splitting must be supported
stored as orc TBLPROPERTIES ('transactional'='true'); Add transaction support to table properties
Copy the code

Insert data:

-- Insert id 1001 and name 'student_1001 '
insert into table student_txn values('1001'.'student_1001');
Copy the code

Update data:

update student_txn
set name='student_lzh'
where id='1001';
Copy the code

If you look at the table data, you will find that the id 1001 has been changed to sutdent_lzh;

5. Orc-related Hive configurations

The configuration items of the table are as follows:

Orc.com press: indicates the compression type of ORC files. The options are NONE, ZLIB, and SNAPPY. The default value is ZLIB.
Orc.press. size: indicates the size of chunk. The default value is 262144 (256KB).
Orc.strie. size: specifies the size of the memory buffer pool that can be used to write the stripe. The default value is 67108864 (64MB).
Orc.row.index. Stride: Indicates the data size of the row group level index. The default value is 10000 and the value must be greater than or equal to 10000.
Orc.create. index: Specifies whether to create a row group level index. The default value is true.
Orc.bloom.filter. columns: The group to which a Bloom filter is to be created.
Orc.bloom.filter. FPP: False Positive probability of using bloom filter. The default value is 0.05.

Note: in the Hive using bloomberg (bloom) filter, can use fewer file space quickly determine whether the data exists in the table, but also will not belong to this table to determine the data belong to the the tables, this situation is called a false positive probability, can manually adjust the probability, but the probability is lower, bloom filter needed more space.

Fifth, Parquet

Parquet is another high-performance determinant storage architecture that works with a variety of computing frameworks and is supported by a variety of query engines, including Hive, Impala, Drill, and more.

1. Parquet basic structure:

In a Parquet type Hive table file, data is divided into row groups, and each column block is divided into pages, as shown in the following figure:

When Parquet stores data, it records the metadata of the data as ORC does, and the metadata is divided into multiple levels of file-level metadata, column block-level metadata, and page-level metadata, just like Parquet’s file structure.

The file level metadata (fileMetadata) records are as follows:

Table structure information (Schema)
The number of records in this file;
The number of row groups owned by the file, and the total amount of data and records of each row group;
The file offset of the column block under each row group.

The metadata information of the column block is as follows:

Record the uncompressed and compressed data size and compression code of the column block;
Offset of data pages;
The offset of the index page;
Number of data records in a column block.

The header metadata information is as follows:

Encoding information for the page;
The number of data records for the page.

Programs can use Parquet’s metadata to filter out most of the file data they don’t need to read, speeding up the program.

Like ORC metadata, Parquet’s metadata information can help improve the running speed of the program, but ORC has made certain optimization in reading data to enhance the data reading efficiency. The query consumes fewer cluster resources than the Parquet type.

Parquet is more perfect in nested structure support, while ORC multi-level nested expression is more complex, resulting in greater performance loss.

2. Related configuration of Parquet:

Parameters can be adjusted according to the requirements of different scenarios to achieve program optimization.

Parquet.block. size: the default value is 134217728 bytes, that is, 128MB, indicating the block size of the RowGroup in memory. A large value can improve the efficiency of reading the Parquet file, but requires more memory to write.
Parquet.page. size: the default value is 1048576 bytes, that is, 1MB, indicating the size of each page. This refers specifically to the compressed page size, which is decompressed before reading. A page is the smallest unit of data that Parquet manipulates, and a full page of data must be read each time it is accessed. This value, if set too small, can cause performance problems during compression.
Parquet.com pression: The default value is UNCOMPRESSED, representing the compressed form of the page. The available compression modes include UNCOMPRESSED, SNAPPY, GZIP, and LZO.
Parquet, enable the dictionary: the default is true, according to whether to enable the dictionary coding.
Parquet. Dictionary. Page. Size: the default value is 1048576 byte, which is 1 MB. When dictionary encoding is used, a dictionary page is created in each row and column of Parquet. Using dictionary encoding can provide a good compression effect and reduce the memory footprint of each page if there is a lot of duplicate data in the data page.

3. Configure the Parquet table compression format when the Spark engine is used:

Spark naturally supports Parquet and has a recommended storage format for it (the default storage is Parquet).

The compression format of the Parquet table is configured in the following two situations:

For partitioned tables:

The data compression format of the Parquet table needs to be set using the Parquet configuration item Parquet.com pression. For example: “parquet.compression”=”snappy”.

For non-partitioned tables:

Need through spark.sql.parquet.com pression. Code to set the configuration items Parquet types of data compression format. Directly set parquet.com pression configuration item is invalid, because it will read spark.sql.parquet.com pression. Codec configuration items of value.

When spark.sql.parquet.com pression. When the codec do not set the default value is snappy, parquet.com pression will read the default values.

. Therefore, spark.sql.parquet.com pression codec configuration item is only applicable to set the partition table of Parquet compressed format.

4. Comparison of Parquet and ORC compression formats:

Table type	The default compression	Supported compression formats	describe
ORC	Zlib	None,Zlib,Snappy	ORC can be compressed by Zlib or Snappy. Snappy requires additional installation
Parquet	Uncompressed	Uncompressed,Snappy,Gzip,Lzo	Parquet has the highest compression rate using Gzip, and the efficiency is high using Lzo and Snappy

ORC tables support None, Zlib, and Snappy compression. The default compression is Zlib. However, these three compression formats do not support sharding, so they are suitable for scenarios where a single file is not particularly large. Using Zlib high compression rate, but poor efficiency; Snappy has high efficiency but low compression rate.

The Parquet table supports Uncompress, Snappy, Gzip, and Lzo compression. By default, the Parquet table does not compress Uncompressed. The Lzo compression supports shard, so the Lzo format is chosen in scenarios where the table’s single file is large. Gzip has high compression rate and low efficiency. Snappy and Lzo have high efficiency and low compression ratio.