Hive Internal tables and external tables

  • Create table

    • Create internal tables: move data to the path pointed to by the data warehouse;
    • Create an external table: Only record the path of the data, do not change the location of the data.
  • When the delete table

    • Metadata and data for internal tables are deleted together
    • External tables only delete metadata, not data. External tables are relatively more secure, data organization more flexible, and easy to share source data.
Managed table (external); managed table (external) 2. Internal table data is managed by Hive, and external table data is managed by HDFS. 3. The inner table data stored in the hive. Metastore. Warehouse. Dir "default: / user/hive/warehouse", the external table data storage location is decided by the users themselves. 4. If an internal table is deleted, metadata and storage data are deleted. If an external table is deleted, metadata is deleted, and files in the HDFS are not deleted. 5. Changes to internal tables are directly synchronized to metadata. Changes to the TABLE structure and partitions of external tables need to be modified [MSCK REPAIR TABLE table_name].Copy the code

practice

Upload the data source to HDFS

[hadoop@xinxingdata001 data]$ touch student.txt
[hadoop@xinxingdata001 data]$ vim student.txt 
Copy the code
1,xinxing,22,boy
2,laocao,24,boy
3,shiqin,19,girl
4,xiaoming,50,boy
5,xiaohong,49,girl
Copy the code
[hadoop@xinxingdata001 data]$ hdfs dfs -mkdir /data
[hadoop@xinxingdata001 data]$ hdfs dfs -put student.txt /data
[hadoop@xinxingdata001 data]$ hdfs dfs -ls /data

Found 1 items
-rw-r--r--   1 hadoop supergroup         87 2020-06-11 16:18 /data/student.txt

Copy the code

Build (internal) tables

create managed table tbl_internal (
id int,
name string,
age int,
sex string
)
comment 'This is an internal table'
row format delimited fields terminated by ', ';
Copy the code

managedIs the default parameter for creating a table. A common table is an internal table by default

1. View the table structure

hive (xinxing)> desc formatted tbl_internal;
Copy the code

2. Load the data source to Hive

hive (xinxing)> LOAD DATA INPATH '/data/student.txt' OVERWRITE INTO TABLE tbl_internal;
Copy the code

3. Check whether the data is successfully loaded

hive (xinxing)> select * from tbl_internal; OK tbl_internal.id tbl_internal.name tbl_internal.age tbl_internal.sex 1 xinxing 22 boy 2 laocao 24 boy 3 shiqin 19 girl 4 Xiaoming 50 boy 5 Xiaohong 49 girl Time taken: 0.205 seconds, Touchdown: 5 row(s)Copy the code

4. Open the HDFS web page and check the student. TXT path

  • Build a library in Hive if this parameter is not specifiedlocationThe default path is/user/hive/warehouse/
  • Originally stored to/data/student.txtThe source data in the path has been moved/user/hive/warehouse/xinxing.db/tbl_internalIn the

Build (external) tables

create external table tbl_external (
id int,
name string,
age int,
sex string
)
comment 'This is an external table'
row format delimited fields terminated by ', ';
Copy the code

1. View the table structure

hive (xinxing)> desc formatted tbl_external;
Copy the code

2. Load the data source to Hive

hive (xinxing)> LOAD DATA INPATH '/data/student.txt' OVERWRITE INTO TABLE tbl_external;
Copy the code

3. Check whether the data is successfully loaded

hive (xinxing)> select * from tbl_external; OK tbl_internal.id tbl_internal.name tbl_internal.age tbl_internal.sex 1 xinxing 22 boy 2 laocao 24 boy 3 shiqin 19 girl 4 Xiaoming 50 boy 5 Xiaohong 49 girl Time taken: 0.205 seconds, Touchdown: 5 row(s)Copy the code

Look at the differences between internal and external tables

1, delete two tables

hive (xinxing)> drop table tbl_internal;
OK
Time taken: 0.518 seconds

hive (xinxing)> drop table tbl_external;
OK
Time taken: 0.056 seconds
Copy the code

2. View the two tables in HDFS

  • Thus it can be seentbl_internalThis table no longer exists in HDFS,tbl_external/student.txtThere is still a

3. Create it againtbl_externalTable structure

create external table tbl_external (
id int,
name string,
age int,
sex string
)
comment 'This is an external table'
row format delimited fields terminated by ', ';

Copy the code

4, the querytbl_externaltable

hive (xinxing)> select * from tbl_external; OK tbl_internal.id tbl_internal.name tbl_internal.age tbl_internal.sex 1 xinxing 22 boy 2 laocao 24 boy 3 shiqin 19 girl 4 Xiaoming 50 boy 5 Xiaohong 49 girl Time taken: 0.205 seconds, Touchdown: 5 row(s)Copy the code

conclusion

1. When data is loaded to Hive, both internal and external table source data is moved toThe default path is /database/tbl_Under the library

2. If location is used to specify the source data path, the source data will not be moved

3, internal table delete, delete table structure and data

4. Delete the external table. Only the table structure is deleted, and the data still exists in HDFS.

The official website for the introduction of the inner (outer) table

A table created without the EXTERNAL clause is called a managed table because Hive manages its data. Managed and External Tables By default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration. Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables. An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information. Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped. Managed or external tables can be identified using the DESCRIBE FORMATTED table_name command, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on table type. Statistics can be managed on internal and external tables and partitions for query optimization.

Website links: cwiki.apache.org/confluence/…