This is the 8th day of my participation in the August Text Challenge.More challenges in August

The body of the

instructions

  • Hive supports the following storage formats: TEXTFILE, SEQUENCEFILE, ORC, and PARQUET.
  • TEXTFILE is the default format. The default format is used when creating a table. When importing data, the data file is directly copied to the HDFS without processing.
  • SEQUENCEFILE, ORC, and PARQUET tables cannot import data directly from a local file and then use insert from TEXTFILE to import data into SEQUENCEFILE, ORC, and PARQUET tables.
  • TEXTFILE and SEQUENCEFILE are row-based storage formats;
  • ORC and PARQUET are based on column storage.

Read more about SEQUENCEFILE on my blog – What is SEQUENCEFILE? How does it work?

For more on ORC, see my blog – What is an ORCFile?

For more on PARQUET, see my blog – What is a PARQUET

For more on AVRO, see my blog – What is Apache AVRO?

All file storage formats supported by Hive

Storage format describe
STORED AS TEXTFILE Store as a plain text file. TEXTFILE is the default fileformat, unless the configuration parameter hive.default.fileformat is set differently. Read the DELIMITED file using the DELIMITED clause. Use the “ESCAPED BY” clause (e.g., ESCAPED BY “) to enable escape for delimiters. If you want to process data that contains these delimiters, escape is required. You can also specify a custom NULL format (default: “\N”) using the “NULL DEFINED AS” clause. (Hive4.0) all binary columns in the table are assumed to be base64 encoded. To read the original data bytes: TBLPROPERTIES (” hive. Serialization. Decode. Binary. As the base64 “=” false “)
STORED AS SEQUENCEFILE Stored as a compressed sequence file.
STORED AS ORC Store in ORC file format. Support for ACID transactions and cost-based optimizer (CBO). Store column-level metadata.
STORED AS PARQUET Stored as Parquet format for the Parquet columnar storage format in Hive 0.13.0 and later; Use ROW FORMAT SERDE in Hive 0.10, 0.11, or 0.12. STORED AS INPUTFORMAT … OUTPUTFORMAT
STORED AS AVRO Hive 0.14.0 or later is stored in Avro format
STORED AS RCFILE Store in record column file format.
STORED AS JSONFILE Hive 4.0.0 or later as a Json file.
STORED BY Store in a non-local table format. Create or link to non-native tables, such as those supported by HBase or Druid or Accumulo.
INPUTFORMAT and OUTPUTFORMAT In file_format, specify the names of the corresponding InputFormat and OutputFormat classes as string text. For example, ‘. Org. Apache hadoop. Hive. Contrib. Fileformat. Base64. Base64TextInputFormat ‘. For LZO compression, to use a value of ‘INPUTFORMAT “com. Hadoop. Mapred. DeprecatedLzoTextInputFormat”, The output format “. Org. Apache hadoop. Hive. Ql. IO. HiveIgnoreKeyTextOutputFormat”

File storage format comparison

1. Compression ratio test of stored files

1.1 Test Data

The test data

Log. TXT is 18.1 MB in size

1.2 TEXTFILE

  • Create a table to store data in a TEXTFILE format
create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as  TEXTFILE ;
Copy the code
  • Load data into a table
load data local inpath '/home/hadoop/log.txt' into table log_text ;
Copy the code
  • Check the data size of the table
dfs -du -h /user/hive/warehouse/log_text; +------------------------------------------------+--+ | DFS Output | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - + 18.1 M/user / | hive/warehouse/log_text/log. TXT | +------------------------------------------------+--+Copy the code

1.3 PARQUET

  • Create a table and store the data in the PARQUET format
create table log_parquet  (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as PARQUET;
Copy the code
  • Load data into a table
insert into table log_parquet select * from log_text;
Copy the code
  • Check the data size of the table
dfs -du -h /user/hive/warehouse/log_parquet; +----------------------------------------------------+--+ | DFS Output | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - + 13.1 M/user/hive | 000000 _0 / warehouse/log_parquet / | +----------------------------------------------------+--+Copy the code

1.4 ORC

  • Create tables and store data in ORC format
create table log_orc  (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC  ;
Copy the code
  • Load data into a table
insert into table log_orc select * from log_text ;
Copy the code
  • Check the data size of the table
dfs -du -h /user/hive/warehouse/log_orc; +-----------------------------------------------+--+ | DFS Output | +-----------------------------------------------+--+ | 2.8 M/user/hive/warehouse/log_orc / 000000 _0 | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - +Copy the code

1.5 Summary of compression ratio of stored files

ORC >  PARQUET >   TEXTFILE
Copy the code

2, storage file query speed test

2.1 TEXTFILE

select count(*) from log_text;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (16.99 seconds)
Copy the code

2.2 PARQUET

select count(*) from log_parquet;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (17.994 seconds)
Copy the code

2.3 ORC

select count(*) from log_orc;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (15.943 seconds)
Copy the code

2.4 Summary of query speed of stored files

ORC >  TEXTFILE > PARQUET
Copy the code

3. Combination of storage and compression

  • The advantage of using compression is that you can minimize the disk storage space required, as well as reduce disk and network IO operations

  • ORC supports three types of compression: ZLIB,SNAPPY, and NONE. The last one is uncompressed, ==ORC uses ZLIB compression == by default.

3.1 Create a non-compressed ORC storage mode table

  • Create an uncompressed ORC table
create table log_orc_none (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC tblproperties("ORC.compress"="NONE") ;
Copy the code
  • 2. Load data
insert into table log_orc_none select * from log_text ;
Copy the code
  • 3. Check the size of the table
dfs -du -h /user/hive/warehouse/log_orc_none; +----------------------------------------------------+--+ | DFS Output | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - + 7.7 M/user/hive | 000000 _0 / warehouse/log_orc_none / | +----------------------------------------------------+--+Copy the code

3.2 Creating an ORC Storage Mode table for SNAPPY Compression

  • 1. Create a SNappy compressed ORC table
create table log_orc_snappy (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC tblproperties("ORC.compress"="SNAPPY") ;
Copy the code
  • 2. Load data
insert into table log_orc_snappy select * from log_text ;
Copy the code
  • 3. Check the size of the table
dfs -du -h /user/hive/warehouse/log_orc_snappy; +------------------------------------------------------+--+ | DFS Output | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - + 3.8 M/user/hive | 000000 _0 / warehouse/log_orc_snappy / | +------------------------------------------------------+--+Copy the code

3.3 Create a ZLIB compressed ORC storage mode table

  • ZLIB compression is used by default if the compression format is not specified
  • Refer to the log_ORC table created above
  • Check the data size of the table
dfs -du -h /user/hive/warehouse/log_orc; +-----------------------------------------------+--+ | DFS Output | +-----------------------------------------------+--+ | 2.8 M/user/hive/warehouse/log_orc / 000000 _0 | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - +Copy the code

Storage mode and compression summary

  • ORC’s default compression method, ZLIB, is smaller than Snappy compression.
  • In actual projects, hive tables are stored in the following format: ==ORC or PARQUET==.
  • Because the compression and decompression efficiency of SNappy is high, the == compression mode is generally snappy==