This is the 8th day of my participation in the August Text Challenge.More challenges in August

The body of the

instructions

Hive supports the following storage formats: TEXTFILE, SEQUENCEFILE, ORC, and PARQUET.
TEXTFILE is the default format. The default format is used when creating a table. When importing data, the data file is directly copied to the HDFS without processing.
SEQUENCEFILE, ORC, and PARQUET tables cannot import data directly from a local file and then use insert from TEXTFILE to import data into SEQUENCEFILE, ORC, and PARQUET tables.
TEXTFILE and SEQUENCEFILE are row-based storage formats;
ORC and PARQUET are based on column storage.

Read more about SEQUENCEFILE on my blog – What is SEQUENCEFILE? How does it work?

For more on ORC, see my blog – What is an ORCFile?

For more on PARQUET, see my blog – What is a PARQUET

For more on AVRO, see my blog – What is Apache AVRO?

All file storage formats supported by Hive

Storage format	describe
STORED AS TEXTFILE	Store as a plain text file. TEXTFILE is the default fileformat, unless the configuration parameter hive.default.fileformat is set differently. Read the DELIMITED file using the DELIMITED clause. Use the “ESCAPED BY” clause (e.g., ESCAPED BY “) to enable escape for delimiters. If you want to process data that contains these delimiters, escape is required. You can also specify a custom NULL format (default: “\N”) using the “NULL DEFINED AS” clause. (Hive4.0) all binary columns in the table are assumed to be base64 encoded. To read the original data bytes: TBLPROPERTIES (” hive. Serialization. Decode. Binary. As the base64 “=” false “)
STORED AS SEQUENCEFILE	Stored as a compressed sequence file.
STORED AS ORC	Store in ORC file format. Support for ACID transactions and cost-based optimizer (CBO). Store column-level metadata.
STORED AS PARQUET	Stored as Parquet format for the Parquet columnar storage format in Hive 0.13.0 and later; Use ROW FORMAT SERDE in Hive 0.10, 0.11, or 0.12. STORED AS INPUTFORMAT … OUTPUTFORMAT
STORED AS AVRO	Hive 0.14.0 or later is stored in Avro format
STORED AS RCFILE	Store in record column file format.
STORED AS JSONFILE	Hive 4.0.0 or later as a Json file.
STORED BY	Store in a non-local table format. Create or link to non-native tables, such as those supported by HBase or Druid or Accumulo.
INPUTFORMAT and OUTPUTFORMAT	In file_format, specify the names of the corresponding InputFormat and OutputFormat classes as string text. For example, ‘. Org. Apache hadoop. Hive. Contrib. Fileformat. Base64. Base64TextInputFormat ‘. For LZO compression, to use a value of ‘INPUTFORMAT “com. Hadoop. Mapred. DeprecatedLzoTextInputFormat”, The output format “. Org. Apache hadoop. Hive. Ql. IO. HiveIgnoreKeyTextOutputFormat”

File storage format comparison

1. Compression ratio test of stored files

1.1 Test Data

The test data

Log. TXT is 18.1 MB in size

1.2 TEXTFILE

Create a table to store data in a TEXTFILE format

create table log_text (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as  TEXTFILE ;
Copy the code

Load data into a table

load data local inpath '/home/hadoop/log.txt' into table log_text ;
Copy the code

Check the data size of the table

dfs -du -h /user/hive/warehouse/log_text; +------------------------------------------------+--+ | DFS Output | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - + 18.1 M/user / | hive/warehouse/log_text/log. TXT | +------------------------------------------------+--+Copy the code

1.3 PARQUET

Create a table and store the data in the PARQUET format

create table log_parquet  (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as PARQUET;
Copy the code

Load data into a table

insert into table log_parquet select * from log_text;
Copy the code

Check the data size of the table

dfs -du -h /user/hive/warehouse/log_parquet; +----------------------------------------------------+--+ | DFS Output | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - + 13.1 M/user/hive | 000000 _0 / warehouse/log_parquet / | +----------------------------------------------------+--+Copy the code

1.4 ORC

Create tables and store data in ORC format

create table log_orc  (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC  ;
Copy the code

Load data into a table

insert into table log_orc select * from log_text ;
Copy the code

Check the data size of the table

dfs -du -h /user/hive/warehouse/log_orc; +-----------------------------------------------+--+ | DFS Output | +-----------------------------------------------+--+ | 2.8 M/user/hive/warehouse/log_orc / 000000 _0 | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - +Copy the code

1.5 Summary of compression ratio of stored files

ORC >  PARQUET >   TEXTFILE
Copy the code

2, storage file query speed test

2.1 TEXTFILE

select count(*) from log_text;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (16.99 seconds)
Copy the code

2.2 PARQUET

select count(*) from log_parquet;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (17.994 seconds)
Copy the code

2.3 ORC

select count(*) from log_orc;
+---------+--+
|   _c0   |
+---------+--+
| 100000  |
+---------+--+
1 row selected (15.943 seconds)
Copy the code

2.4 Summary of query speed of stored files

ORC >  TEXTFILE > PARQUET
Copy the code

3. Combination of storage and compression

The advantage of using compression is that you can minimize the disk storage space required, as well as reduce disk and network IO operations
ORC supports three types of compression: ZLIB,SNAPPY, and NONE. The last one is uncompressed, ==ORC uses ZLIB compression == by default.

3.1 Create a non-compressed ORC storage mode table

Create an uncompressed ORC table

create table log_orc_none (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC tblproperties("ORC.compress"="NONE") ;
Copy the code

2. Load data

insert into table log_orc_none select * from log_text ;
Copy the code

3. Check the size of the table

dfs -du -h /user/hive/warehouse/log_orc_none; +----------------------------------------------------+--+ | DFS Output | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - + 7.7 M/user/hive | 000000 _0 / warehouse/log_orc_none / | +----------------------------------------------------+--+Copy the code

3.2 Creating an ORC Storage Mode table for SNAPPY Compression

1. Create a SNappy compressed ORC table

create table log_orc_snappy (
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as ORC tblproperties("ORC.compress"="SNAPPY") ;
Copy the code

2. Load data

insert into table log_orc_snappy select * from log_text ;
Copy the code

3. Check the size of the table

dfs -du -h /user/hive/warehouse/log_orc_snappy; +------------------------------------------------------+--+ | DFS Output | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - + 3.8 M/user/hive | 000000 _0 / warehouse/log_orc_snappy / | +------------------------------------------------------+--+Copy the code

3.3 Create a ZLIB compressed ORC storage mode table

ZLIB compression is used by default if the compression format is not specified
Refer to the log_ORC table created above
Check the data size of the table

dfs -du -h /user/hive/warehouse/log_orc; +-----------------------------------------------+--+ | DFS Output | +-----------------------------------------------+--+ | 2.8 M/user/hive/warehouse/log_orc / 000000 _0 | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + - +Copy the code

Storage mode and compression summary

ORC’s default compression method, ZLIB, is smaller than Snappy compression.
In actual projects, hive tables are stored in the following format: ==ORC or PARQUET==.
Because the compression and decompression efficiency of SNappy is high, the == compression mode is generally snappy==

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

What is the Hive file storage format?

The body of the

instructions

All file storage formats supported by Hive

File storage format comparison

1. Compression ratio test of stored files

1.1 Test Data

1.2 TEXTFILE

1.3 PARQUET

1.4 ORC

1.5 Summary of compression ratio of stored files

2, storage file query speed test

2.1 TEXTFILE

2.2 PARQUET

2.3 ORC

2.4 Summary of query speed of stored files

3. Combination of storage and compression

3.1 Create a non-compressed ORC storage mode table

3.2 Creating an ORC Storage Mode table for SNAPPY Compression

3.3 Create a ZLIB compressed ORC storage mode table

Storage mode and compression summary

What is the Hive file storage format?

The body of the

instructions

All file storage formats supported by Hive

File storage format comparison

1. Compression ratio test of stored files

1.1 Test Data

1.2 TEXTFILE

1.3 PARQUET

1.4 ORC

1.5 Summary of compression ratio of stored files

2, storage file query speed test

2.1 TEXTFILE

2.2 PARQUET

2.3 ORC

2.4 Summary of query speed of stored files

3. Combination of storage and compression

3.1 Create a non-compressed ORC storage mode table

3.2 Creating an ORC Storage Mode table for SNAPPY Compression

3.3 Create a ZLIB compressed ORC storage mode table

Storage mode and compression summary

Related Posts

You say, face recognition | Hi peng is cat or people, or?

Two Turing award winners: A New Golden Age of Computer Architecture

Have not found a home after resignation, can five insurances and one gold be broken?