How to efficiently import and update data in batch data warehouse

Abstract:GAUSSDB (DWS) supports the Merge Into function, which can be used to update and insert large amounts of data simultaneously. It is a very important technology for data warehouse.

This article is shared from Huawei Cloud Community “how to efficiently import and update data in bulk”, the original author: Acydy.

preface

What if we have a table that we want both to update and to insert? You can use UPDATE and INSERT to accomplish your goals.

If you have a large amount of data and want to complete the task as soon as possible, is there any alternative? Don’t miss the Merge Into feature of GaussDB(DWS).

The MERGE INTO concept

Merge Into is a standard introduced in SQL 2003.

If a table T, as well as being updatable, is insertable-into, then rows can be inserted into it (subject to applicable Access Rules and Conformance Rules). The primary effect of an <insert statement> on T is to insert into T each of the zero or more rows contained in a specified table. The primary effect of a <merge statement> on T is to replace zero or more rows in T with specified rows and/or to insert into T zero or more specified rows, depending on the result of a <search condition> and on whether one or both of <merge when matched clause> and <merge when not matched clause> are specified.

A table can be either updated or inserted in a statement. Merge When Matched: merge When Not Matched: merge When Not Matched: merge When Not Matched: merge When Not Matched: merge When Not Matched: merge When Not Matched: merge When Not Matched: merge When Not Matched: merge When Not Matched Clause (what to do if the condition does not match)

SQL 2008 is extended to include NOT MATCHED and NOT MATCHED functions.

MERGE has been extended to support multiple MATCHED and NOT MATCHED clauses, each accompanied by a search condition, that gives much greater flexibility in the coding of complex MERGE statements to handle update conflicts.

The MERGE INTO command involves two tables. Target table: The table to be inserted or updated. Source table: The table used to match the target table, the source of the data for the target table.

The MERGE INTO statement matches the data in the target table and the source table against the association condition. UPDATE the target table if the association condition matches, INSERT is performed on the target table if the association condition cannot match.

Usage scenario: When a business needs to add a large amount of data from a table to an existing table, Merge INTO can be used to import the data efficiently and avoid multiple INSERT+UPDATE operations.

The MERGE INTO grammar

GAUSSDB (DWS) MERGE INTO

MERGE INTO table_name [ [ AS ] alias ] USING { { table_name | view_name } | subquery } [ [ AS ] alias ] ON ( condition )  [ WHEN MATCHED THEN UPDATE SET { column_name = { expression | DEFAULT } | ( column_name [, ...] ) = ( { expression | DEFAULT } [, ...] ) } [, ...] [ WHERE condition ] ] [ WHEN NOT MATCHED THEN INSERT { DEFAULT VALUES | [ ( column_name [, ...] ) ] VALUES ( { expression | DEFAULT } [, ...] ) [, ...] [ WHERE condition ] } ];

INTO specifies the target table.
USING to specify the source table. The source table can be a regular table or a subquery.
On association condition, which specifies the association condition of the target table and the source table.
When MATCHED: When MATCHED: When MATCHED: When MATCHED: When MATCHED: When MATCHED: When MATCHED: When MATCHED
When NOT MATCHED When NOT MATCHED When NOT MATCHED When NOT MATCHED When NOT MATCHED When NOT MATCHED When NOT MATCHED
- When MATCHED, WHEN NOT MATCHED
- When MATCHED, WHEN NOT MATCHED
- When MATCHED, WHEN NOT MATCHED

Practical application

First create the following tables to perform the MREGE INTO operation.

gaussdb=# CREATE TABLE dst (
  product_id INT,
  product_name VARCHAR(20),
  category VARCHAR(20),
  total INT
) DISTRIBUTE BY HASH(product_id);

gaussdb=# CREATE TABLE dst_data (
  product_id INT,
  product_name VARCHAR(20),
  category VARCHAR(20),
  total INT
) DISTRIBUTE BY HASH(product_id);

gaussdb=# CREATE TABLE src (
  product_id INT,
  product_name VARCHAR(20),
  category VARCHAR(20),
  total INT
) DISTRIBUTE BY HASH(product_id);

gaussdb=# INSERT INTO dst_data VALUES(1601,'lamaze','toys',100),(1600,'play gym','toys',100),(1502,'olympus','electrncs',100),(1501,'vivitar','electrnc',100),(1666,'harry potter','dvd',100);
gaussdb=# INSERT INTO src VALUES(1700,'wait interface','books',200),(1666,'harry potter','toys',200),(1601,'lamaze','toys',200),(1502,'olympus camera','electrncs',200);
gaussdb=# INSERT INTO dst SELECT * FROM dst_data;

When MATCHED and WHEN NOT MATCHED

View the plan to see how Merge Into is executed.

MERGE INTO (MERGE INTO) MERGE INTO (MERGE INTO) MERGE INTO

gaussdb=# EXPLAIN (COSTS off)
MERGE INTO dst x
USING src y
ON x.product_id = y.product_id
WHEN MATCHED THEN
  UPDATE SET product_name = y.product_name, category = y.category, total = y.total
WHEN NOT MATCHED THEN
  INSERT VALUES (y.product_id, y.product_name, y.category, y.total);
 
                    QUERY PLAN
--------------------------------------------------
  id |                operation
-----+--------------------------------------------
   1 | ->  Streaming (type: GATHER)
   2 |    ->  Merge on dst x
   3 |       ->  Streaming(type: REDISTRIBUTE)
   4 |          ->  Hash Left Join (5, 6)
   5 |             ->  Seq Scan on src y
   6 |             ->  Hash
   7 |                ->  Seq Scan on dst x

  Predicate Information (identified by plan id)
 ------------------------------------------------
   4 --Hash Left Join (5, 6)
         Hash Cond: (y.product_id = x.product_id)
(14 rows)

Why is this a Left Join?

Since the target table needs to be updated when it matches the source table, data is inserted into the target table when it does not match. That is, part of the source table data is used to update the target table, and part is used to insert into the target table. The semantics are similar to that of a LEFT JOIN.

 5 --Seq Scan on public.src y
         Output: y.product_id, y.product_name, y.category, y.total, y.ctid
         Distribute Key: y.product_id
   6 --Hash
         Output: x.product_id, x.product_name, x.category, x.total, x.ctid, x.xc_node_id
   7 --Seq Scan on public.dst x
         Output: x.product_id, x.product_name, x.category, x.total, x.ctid, x.xc_node_id
         Distribute Key: x.product_id

Execute Merge Into to see the results.

The two tables can be related when PRODUCT_ID is 1502,1601,1666, so the three records are updated. Mysql > SELECT * FROM SRC WHERE PRODUCT_ID = 1700; Others have not been modified.

gaussdb=# SELECT * FROM dst ORDER BY 1; product_id | product_name | category | total ------------+--------------+-----------+------- 1501 | vivitar | electrnc |  100 1502 | olympus | electrncs | 100 1600 | play gym | toys | 100 1601 | lamaze | toys | 100 1666 | harry potter | dvd | 100 (5 rows) gaussdb=# SELECT * FROM src ORDER BY 1; product_id | product_name | category | total ------------+----------------+-----------+------- 1502 | olympus camera | electrncs | 200 1601 | lamaze | toys | 200 1666 | harry potter | toys | 200 1700 | wait interface | books | 200 (4 rows)  gaussdb=# MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = y.product_name, category = y.category, total = y.total WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total); MERGE 4 gaussdb=# SELECT * FROM dst ORDER BY 1; product_id | product_name | category | total ------------+----------------+-----------+------- 1501 | vivitar | electrnc 1502 | | 100 -- unmodified push around camera | electrncs 1600 | | 200 - update play gym | toys 1601 | | 100 -- unmodified lamaze | toys | 200-1666 | update harry potter | toys | 200 -- updated 1700 | wait interface | books | 200 - insert (6 rows)

Check the number of UPDATE and INSERT

You can view the number of updates and inserts from Explain Performance or Explain Analyze. (Only the essential parts are shown here)

You can see in the Predicate Information section that there is one insert and three updates.

You can see the Information for each node in the DataNode Information section. Update datanode1 update 2 datanode2 update 1

gaussdb=# EXPLAIN PERFORMANCE MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = y.product_name, category = y.category, total = y.total WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total); Predicate Information (identified by plan id) ------------------------------------------------ 2 --Merge on public.dst x  Merge Inserted: 1 Merge Updated: 3 Datanode Information (identified by plan id) --------------------------------------------------------------------------------------- 2 --Merge on public.dst x datanode1 (Tuple Inserted 0, Tuple Updated 2) datanode2 (Tuple Inserted 1, Tuple Updated 1)

When NOT MATCHED.

When NOT MATCHED: When NOT MATCHED: When NOT MATCHED: When NOT MATCHED: When NOT MATCHED: When NOT MATCHED: When NOT MATCHED

gaussdb=# EXPLAIN (COSTS off) MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = y.product_name, category = y.category, total = y.total; QUERY PLAN -------------------------------------------------- id | operation ----+----------------------------------- 1 | -> Streaming (type: GATHER) 2 | - > Merge on DST 3 x | - > Hash Join (4, 5) 4 | - > Seq Scan on DST x 5 | - > Hash 6 | - > Seq Scan on SRC y Predicate Information (identified by the plan id) -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 3 -- Hash Join Hash (4, 5) Cond: (x.product_id = y.product_id) (13 rows)

View the results after execution. Merge Into only operates on 3 pieces of data.

gaussdb=# truncate dst; gaussdb=# INSERT INTO dst SELECT * FROM dst_data; gaussdb=# MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = y.product_name, category = y.category, total = y.total; MERGE 3 gaussdb=# SELECT * FROM dst; product_id | product_name | category | total ------------+----------------+-----------+------- 1501 | vivitar | electrnc 1502 | | 100 -- unmodified push around camera | electrncs 1600 | | 200 - update play gym | toys 1601 | | 100 -- unmodified lamaze | toys | 200-1666 | update harry potter | toys | 200 - update (5 rows)

When NOT MATCHED

Inserts only if there is a mismatch. No data has been updated in the results.

gaussdb=# EXPLAIN (COSTS off) MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total); QUERY PLAN -------------------------------------------------- id | operation ----+----------------------------------------- 1 | -> Streaming (type: GATHER) 2 | -> Merge on dst x 3 | -> Streaming(type: REDISTRIBUTE) 4 | -> Hash Left Join (5, 6) 5 | -> Seq Scan on src y 6 | -> Hash 7 | -> Seq Scan on dst x Predicate Information (identified by plan id) ------------------------------------------------ 4 --Hash Left Join (5, 6) Hash Cond: (y.product_id = x.product_id) (14 rows) gaussdb=# truncate dst; gaussdb=# INSERT INTO dst SELECT * FROM dst_data; gaussdb=# MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN NOT MATCHED THEN INSERT VALUES (y.product_id,  y.product_name, y.category, y.total); MERGE 1 gaussdb=# SELECT * FROM dst ORDER BY 1; product_id | product_name | category | total ------------+----------------+-----------+------- 1501 | vivitar | electrnc 1502 | | 100 -- unmodified push around | electrncs 1600 | | 100 -- unmodified play gym | toys 1601 | | 100 -- unmodified lamaze | toys | - 100 Unmodified 1666 | harry potter DVDS | 1700 | | 100 -- unmodified wait interface | books | 200 - insert (6 rows)

WHERE Filtering Conditions

The semantics are to determine whether the current row meets the filter condition before an update or insert is performed, and if not, no update or insert is performed. If you do not want to update the field, you need to specify the filter criteria.

The following example updates only the rows where product_name = ‘Olympus’ when two tables are correlative. Product_id of the source table when the two tables cannot be related! = 1700 will be inserted.

gaussdb=# truncate dst; gaussdb=# INSERT INTO dst SELECT * FROM dst_data; gaussdb=# MERGE INTO dst x USING src y ON x.product_id = y.product_id WHEN MATCHED THEN UPDATE SET product_name = y.product_name, category = y.category, total = y.total WHERE x.product_name = 'olympus' WHEN NOT MATCHED THEN INSERT VALUES (y.product_id, y.product_name, y.category, y.total) WHERE y.product_id ! = 1700; MERGE 1 gaussdb=# SELECT * FROM dst ORDER BY 1; SELECT * FROM dst ORDER BY 1; product_id | product_name | category | total ------------+----------------+-----------+------- 1501 | vivitar | electrnc  | 100 1502 | olympus camera | electrncs | 200 1600 | play gym | toys | 100 1601 | lamaze | toys | 100 1666 | harry potter | dvd | 100 (5 rows)

The subquery

Subqueries can be used in the Using section for more complex association operations.

The result of the aggregation operation on the source table is then matched with the target table

MERGE INTO dst x
USING (
  SELECT product_id, product_name, category, sum(total) AS total FROM src group by product_id, product_name, category
) y
ON x.product_id = y.product_id
WHEN MATCHED THEN
    UPDATE SET product_name = x.product_name, category = x.category, total = x.total
WHEN NOT MATCHED THEN
    INSERT VALUES (y.product_id, y.product_name, y.category, y.total + 200);

The result of the UNION of multiple tables is then matched with the target table

MERGE INTO dst x
USING (
  SELECT 1501 AS product_id, 'vivitar 35mm' AS product_name, 'electrncs' AS category, 100 AS total UNION ALL
  SELECT 1666 AS product_id, 'harry potter' AS product_name, 'dvd' AS category, 100 AS total
) y
ON x.product_id = y.product_id
WHEN MATCHED THEN
    UPDATE SET product_name = x.product_name, category = x.category, total = x.total
WHEN NOT MATCHED THEN
    INSERT VALUES (y.product_id, y.product_name, y.category, y.total + 200);

The stored procedure

gaussdb=# CREATE OR REPLACE PROCEDURE store_procedure1()
AS
BEGIN
  MERGE INTO dst x
  USING src y
  ON x.product_id = y.product_id
  WHEN MATCHED THEN
    UPDATE SET product_name = y.product_name, category = y.category, total = y.total;
END;
/

CREATE PROCEDURE
gaussdb=# CALL store_procedure1();

Merge INTO

MREGE INTO is converted INTO a LEFT JOIN or INNER JOIN to associate the target table with the source table. So how do you know if a row wants to be updated or inserted?

View the output of the operator by Explain Verbose. CTID columns are printed when both tables are scanned. So what does the CTID column do?

  5 --Seq Scan on public.src y
         Output: y.product_id, y.product_name, y.category, y.total, y.ctid
         Distribute Key: y.product_id
   6 --Hash
         Output: x.product_id, x.product_name, x.category, x.total, x.ctid, x.xc_node_id
   7 --Seq Scan on public.dst x
         Output: x.product_id, x.product_name, x.category, x.total, x.ctid, x.xc_node_id
         Distribute Key: x.product_id

The CTID identifies the specific location of the row in storage, so that the data at this location can be updated. As an MPP distributed database, GAUSSDB (DWS) also needs to know the node information (xc_node_id). These two values are required for the UPDATE operation.

CTID is also useful here in MREGE INTO. When the target table matches and needs to be updated, the row’s CTID value is retained. If it doesn’t match, just insert it. If a CTID is not required, the CTID value is NULL. Depending on whether the CTID output from the LEFT JOIN is NULL, the final decision is whether the row should be updated or inserted.

After the two tables are joined, update or insert a row based on the CTID column output after the JOIN.

Matters needing attention

When using Merge Into, pay attention to whether the match criteria are appropriate. If you are not careful, it is easy for the data to be updated unexpectedly, and the whole table may be updated.

conclusion

GAUSSDB(DWS) provides the efficient data import function Merge Into, which is a very key function for the data warehouse. You can use MERGE INTO to update and insert a table at the same time, and you can complete the data import very quickly even when the data volume is very large.

Want to know more information about GUASSDB (DWS), welcome WeChat search “GAUSSDB DWS” pay attention to WeChat public number, and you share the latest and most complete PB series silo black technology, background can also obtain many learning materials Oh ~

Click on the attention, the first time to understand Huawei cloud fresh technology ~