Hive Underlying File Storage (26)

Row storage versus column storage

Today’s data processing can be roughly divided into two categories: ON-LINE Transaction Processing (OLTP), on-line Analytical processing (OLAP), OLTP is the main application of traditional relational database to perform some basic and daily transactions, such as adding, deleting, modifying and searching database records, while OLAP is the main application of distributed database. It does not have high requirements on real-time performance, but it deals with a large amount of data, which is usually used in complex dynamic reporting systems

Therefore, OLTP generally uses row storage because of high real-time requirements and a large number of update operations. OLAP uses column storage because of low real-time requirements, mainly for good performance

Features of row storage

When querying a full row of data that meets the criteria, only one of the values needs to be found, and the rest of the values are adjacent, so the row stores the query faster.
Traditional relational databases, such as Oracle, DB2, MySQL, and SQL SERVER, use row-based storage. In a database based on row-based storage, data is stored in a logical storage unit based on Row data, and data in a Row is continuously stored in the storage media.
The storage format of Both TEXTFILE and SEQUENCEFILE is row-based
This storage format is convenient for INSERT/UPDATE operations. The disadvantage is that if the query involves only a few columns, it will read the entire row and cannot skip unnecessary column reads. Of course, the data is relatively small, generally there is no problem, if the data volume is relatively large, it will affect the performance, and because of the data type of each row, column is not consistent, it is not easy to obtain a very high compression ratio, that is, the space utilization is not high

Column storage features

When querying, only the columns involved will be queried, not all columns will be queried, that is, unnecessary column query can be skipped, when the query only needs a few fields, can greatly reduce the amount of data read; Because the data in each column is stored together, the data type of each field must be the same. Column storage can be targeted to design better compression algorithm, efficient compression rate, saving not only storage space but also computing memory and CPU
The downside is that INSERT/UPDATE is cumbersome or inconvenient and not suitable for scanning small amounts of data
Column-based storage is relative to row storage. Newly distributed databases such as Hbase, HPVertica and EMCGreenplum all use column-based storage. In a database based on columnar storage, data is stored according to the basic logical storage unit of the column, and the data in a column exists in the form of continuous storage in the storage medium.

The characteristics of column storage: because of the aggregated storage of data for each field, it can greatly reduce the amount of data read when only a few fields are needed for a query; The data type of each field must be the same, and column storage can be tailored to better design compression algorithms. ORC and PARQUET are based on column storage.

Let’s take an example that’s not too abstract, but if you have a table with a billion rows, by the definition of column storage, you’re supposed to store a billion rows of one field before you store the rest of the fields.

Common data formats

Hive supports the following storage formats. The following describes the features of each format

Text File
SequenceFile
RCFile
Avro Files
ORC Files
Parquet
Custom INPUTFORMAT and OUTPUTFORMAT

Hive uses real Text files by default. This means that when you build a table without specifying the File format, it uses Text files. Hive supports specifying the default File format

<property>
  <name>hive.default.fileformat</name>
  <value>TextFile</value>
  <description>
    Expects one of [textfile, sequencefile, rcfile, orc, parquet].
    Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
  </description>
</property>
Copy the code

TextFile

Storage mode: row storage

The default storage format does not compress data, resulting in high disk overhead and high data parsing overhead. You can use it in combination with Gzip and Bzip2 (the system automatically checks and decompresses the files during query). However, the compressed files do not support split. Hive does not split data and therefore cannot perform parallel operations on data.

In addition, in the deserialization process, the delimiter and line end character must be determined character by character. Therefore, the deserialization cost is dozens of times higher than SequenceFile.

SequenceFile

SequenceFile is a binary file supported by the Hadoop API. It is stored in a row and is easy to use, split, and compressed.

Compressed data files can save disk space, but some native compressed files in Hadoop do not support segmentation, so Hadoop provides the SequenceFile format. The files that support segmentation can be processed by multiple Mapper programs in parallel. Most files do not support divisibility because they can only be read from scratch.

SequenceFile supports three compression options: NONE, RECORD, and BLOCK. The compression rate of Record is low. BLOCK compression is recommended. Record is the default option.

The advantage of SequenceFile is that files are compatible with mapFiles in the Hadoop API.

Note: This format is used to build tables. When importing data, data files will be directly copied to HDFS without processing. Data in SequenceFile, RCFile, or ORC tables cannot be directly imported from the local file. Data must be imported to the TextFile table first, and then imported to the SequenceFile and RCFile tables using INSERT from the TextFile table

RCfile

Storage mode: Data is divided into rows and each block is stored in columns

Short for Record Columnar, the first Columnar storage format in Hadoop. Provides good compression and fast query performance, but does not support schema evolution. It is a combination of column and column storage.

First, it blocks data into rows to keep the data in the same row in the same block, avoiding the need to read multiple blocks to read a record. Secondly, block data column storage is conducive to data compression and fast column access, and can skip unnecessary column reading

ORCfile

Storage: Data is stored in columns, in blocks, in rows (not columns, but segmented columns)

Optimized Row Columnar ORC file format is a column storage format in the Hadoop ecosystem. It was created in early 2013 from Apache Hive and is used to reduce Hadoop data storage space and speed up Hive query. Like Parquet, it is not a pure column storage format. It still splits the entire table by row group first, and stores columns within each row group.

ORC files are self-describing, their metadata uses Protocol Buffers serialization, and the data in the file is compressed as much as possible to reduce storage space consumption. It is currently supported by Spark SQL, Presto, and other query engines, but Impala does not currently support ORC. Parquet is still used as the primary column storage format. In 2015, ORC project was promoted to Apache Top Project by Apache Project Foundation.

ORC files feature fast compression and fast column access. They are an improved version of RCFile that can be compressed better than RC, can be queried faster, supports various complex data types such as Datetime and Decimal, and complex structs are stored in binary mode, so they cannot be read directly. The ORC file is also self-parsed and contains a lot of metadata that is serialized by the isomorphic ProtoBuffer.

It is important to note that ORC uses extra CPU resources to compress and decompress while reading and writing, which is of course very low CPU consumption.

format

ORC file: An ORC file can contain multiple stripes. Each ORC file consists of one or more stripes. Each stripe is generally the block size of HDFS and contains multiple records. Corresponding to Parquet is the notion of a row group. Each Stripe consists of three parts: Index Data, Row Data, and Stripe Footer.

Stripe: A group of rows forms a stripe. Each read file is in the unit of the row group. It is generally the block size of HDFS and stores the index and data of each column.

File-level metadata: file description PostScript, file meta-information (including the statistics of the entire file), information about all stripes, and file schema information.

Stripe metadata: Stores the position of the stripe, statistics for each column in that stripe, and all stream types and positions.

Row Group: Minimum unit of an index. A stripe contains multiple Row groups. The default value is 10000. Each read file is in the unit of row groups, usually the block size of HDFS, and stores indexes and data of each column.

The ORC file contains three levels of statistics: file level, stripe level, and row group level. These levels can be used to determine whether certain data can be skipped based on Search ARGuments (predicate pushdown criteria). And set specific statistics for different types of data.

File level: The statistics of columns in the entire file are recorded at the end of the ORC file. This information is mainly used for query optimization, and can also output results for simple aggregate queries such as Max, min, and sum.

** Stripe level: **ORC files hold statistics for each field stripe level. ORC Reader uses these statistics to determine which stripe records to read for a query statement. For example, in a stripe where Max (a)=10 and min(a)=3, if a >10 or a <3, all records in the stripe will not be read during query execution

Row level: To further avoid reading unnecessary data, the index of a column is logically divided into multiple index groups with a given value (default value: 10000, configurable). Take 10000 records as a group, conduct statistics on the data. Hive query engines pass constraints in WHERE conditions to ORC readers, which filter out unnecessary data based on group-level statistics. If the value is set too low, more statistics will be stored, and users will need to weigh a reasonable value based on the characteristics of their data

The data access

The ORC file is read from the end, 16KB first, and as much Postscript and Footer data as possible is read into memory. The last byte of the file contains the PostScript length, which cannot exceed 256 bytes. The metadata information of the entire file is stored in PostScript, including the compressed format of the file, the maximum length of each compressed block in the file (the size of the memory allocated each time), and the Footer length. And some version information. Between Postscript and Footer, the statistics of the entire file are stored (not shown in the figure above). This part of the statistics includes the information of each column in each stripe, the number of main statistics, the maximum value, the minimum value, the empty value, and so on.

The Footer information for the file is then read, which contains the length and offset of each stripe, the file’s schema information (storing the schema tree in an array according to the number in the schema), the statistics for the entire file, and the number of rows for each row group.

When processing stripe, the actual position and length of each stripe and Footer data (metadata, which records the length of index and data) of each stripe are firstly obtained from Footer. The whole striper is divided into index and data. Stripe is divided into blocks by row group (how many records in each row group are stored in the Footer of a file), and row groups are stored in columns. Each row group holds data and index information from multiple streams. The data for each stream is saved using a specific compression algorithm based on the type of column. There are several stream types in ORC:

PRESENT: Each member value holds a bit in the stream to indicate whether the value is NULL or not. It is possible to record only the value of the NULL part
DATA: The value of a member of the current stripe in this column.
LENGTH: The LENGTH of each member. This is only available for string columns.
DICTIONARY_DATA: The contents of a dictionary after encoding string data.
SECONDARY: Stores Decimal, timestamp Decimal or nanosecond values.
ROW_INDEX: Saves the statistics of each row group in the stripe and the start position of each row group.

After all metadata has been retrieved during the initialization phase, you can specify the column number to be read through the includes array, which is a Boolean array. If not, all columns will be read. You can also specify filtering criteria by passing the SearchArgument parameter. First read the index information in each stripe according to the metadata, then determine the number of row groups to be read according to the statistics in the index and the SearchArgument parameter, and then determine the columns to be read from these row groups according to the includes data. The data that needs to be read through the two layers of filtering is only multiple segments of the stripe. ORC then merges multiple discrete segments to minimize I/O times. The next row group stored in index is located in the first row group to be read from the stripe.

If the ORC file format is used, users can use each block of HDFS to store a stripe of the ORC file. For an ORC file, the stripe size must be smaller than the block size of HDFS. If this is not the case, a stripe will be located on multiple blocks of HDFS, and remote data reading will occur when such data is read. If the stripe is set to be stored on only one block, and there is not enough space on the current block to store the next strpie, the ORC writer will then split the data and store it on the remaining space of the block until the block is full. In this way, the next stripe is stored from the next block.

Because ORC uses more precise index information so that data can be read from any row, fine-grained statistics allow ORC files to skip the entire row group. ORC uses ZLIB compression for any chunk of data and index information by default, so ORC files take up less storage space

Parquet

Parquet compresses well, has good query performance, and supports limited schema evolution. But the writing speed is usually slow. The file format is primarily used on Cloudera Impala. The Parquet file is stored in binary and therefore cannot be read directly. The file contains the data and metadata of the file, so the Parquet format file is self-parsed.

Parquet’s design scheme, on the whole, basically follows the leveling and reconstruction algorithm of nested data structure in Dremel, realizes column storage (column group) through efficient data leveling and reconstruction algorithm, and then introduces more targeted coding and compression scheme for column data to reduce storage cost and improve computing performance. To understand the logic of this algorithm, you can read Dremel’s paper: Dremel: Interactive Analysis of WebScaleDatasets

test

Prepare test data

First we generate a test data. This is the test code that generates the data

public class ProduceTestData {
    SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-mm-dd HH:MM:ss");

    @Test
    public void testRandomName(a) throws IOException {
        Faker faker = new Faker(Locale.CHINA);
        final Name name = faker.name();
        final Address address = faker.address();
        Number number = faker.number();
        PhoneNumber phoneNumber = faker.phoneNumber();

        BufferedWriter out = new BufferedWriter(new FileWriter("/Users/liuwenqiang/access.log"));
        int num=0;
        while (num<10000000) {int id = number.randomDigitNotZero();
            String userName = name.name();
            String time = simpleDateFormat.format(new Date(System.currentTimeMillis()));
            String city = address.city();
            String phonenum = phoneNumber.cellPhone();
            StringBuilder stringBuilder = new StringBuilder();
            stringBuilder.append(id);
            stringBuilder.append("\t");

            stringBuilder.append(userName);
            stringBuilder.append("\t");

            stringBuilder.append(city);
            stringBuilder.append("\t");

            stringBuilder.append(phonenum);
            stringBuilder.append("\t"); stringBuilder.append(time); out.write(stringBuilder.toString()); out.newLine(); } out.flush(); out.close(); }}Copy the code

Prepare three tables, log_TEXT, LOG_ORC, and log_parquet

create table log_text(
     id int,
     name string,
     city string,
     phone string,
     acctime string)
row format delimited fields terminated by '\t'
stored as textfile;
LOAD DATA LOCAL INPATH '/Users/liuwenqiang/access.log' OVERWRITE INTO TABLE ods.log_text;
create table log_orc(
     id int,
     name string,
     city string,
     phone string,
     acctime string)
row format delimited fields terminated by '\t'
stored as orc;
insert overwrite table ods.log_orc select * from ods.log_text;
create table log_parquet(
     id int,
     name string,
     city string,
     phone string,
     acctime string)
row format delimited fields terminated by '\t'
stored as parquet;
insert overwrite table ods.log_parquet select * from ods.log_text;
Copy the code

All arguments to ORCFile appear in the TBLPROPERTIES field of Hive SQL statements

Key	Default	Notes
orc.compress	ZLIB	high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size	262144	number of bytes in each compression chunk
orc.compress.size	262144	number of bytes in each compression chunk
orc.row.index.stride	10000	number of rows between index entries (must be >= 1000)
orc.create.index	true	whether to create row indexes

Storage space size

text

orc

parquet

Test SQL execution efficiency

SQL SELECT city,count(1) as CNT from log_text group by city order by CNT desc;