1. The background

Zipper table is an important table data processing method. Data structure can be used in the algorithm, which is analogous to zipper table in the warehouse. It is designed to solve the SCD requirements in the warehouse establishment.

The common methods of handling SCD are as follows:

  • Retains the original value

  • Directly covered

  • Add a new property column

  • The snapshot table

  • Zipper table

This article mainly explains the zipper table to deal with the problem of SCD, its characteristics are summarized as follows, there are several scenarios, you can use the zipper table.

1. There is a large amount of table data, so using the full table will occupy a lot of storage

2. Table data will be modified. It is difficult to process repeated and modified data with an incremental table

3. There is a need to backtrack, to know the full data at a certain point in history

4. The data are modified, but the frequency and amount are not very large, such as only one part in a million

2. Zipper table processing theory

First of all, the zipper table is a full table and not a partitioned table. In order to achieve the effects described above, it is necessary to have an intermediate table as an intermediate springboard. The intermediate springboard table is a partitioned table with incremental data. Two fields that are not related to the original data need to be added to the zip table to identify the start time and end time of the data. In the example, the two dates are start_date and end_date respectively. The zip table can be processed in the following three ways: initialization, daily data update, and rollback.

2.1 Initializing and Adding Data

Its daily scrolling mode is shown below:

The initialization part is the start time of the full scale of the zipper, and also establishes the earliest time that can be rolled back during the rollback. The daily update logic is shown in the figure above. The new data is divided into two parts. To update the data, modify the corresponding start_date and end_date respectively.

2.1 Data Rollback

With the update logic above, let’s consider how to roll back the data, that is, to a point in history, which in the case of the zipper table is full, so only one rollback is required. The rollback strategy can be based on the start_date and end_date generated by the rollback point in time and data. The following diagram shows how to roll back the data:

end_date < rollback_dateThe data to be retained for processingEnd_date ≥ ROLLBACK_date ≥ STARt_dateSet up theend_date9999-12-31For the result of a rollback, generally to maintain data integrity, you can put the rolled back data in a new zipper temporary table.

3. Zipper table processing case

For the common layer DIM or dimension layer is a common scenario for the zipper table, here is an example to see how to add and roll back the zipper table.

The zipper table is used to realize the DIM layer merchant dimension table in the core transaction analysis, and the rollback of the zipper table is realized.

3.1 Creating tables and importing Data

The structure of the merchant dimension table is as follows:

-- Create business information table (delta table partition table)
drop table if exists ods.ods_trade_shops;
create table ods.ods_trade_shops(
  `shopid` int COMMENT 'shops ID',
  `userid` int COMMENT 'Shop Owner', 
  `areaid` int COMMENT 'area ID',
  `shopname` string COMMENT 'Shop Name',
  `shoplevel` int COMMENT 'Shop Level',
  `status` int COMMENT 'Store Status',
  `createtime` string COMMENT 'Creation Date',
  `modifytime` string COMMENT  'Modify date'
) COMMENT 'Merchant Information Sheet'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by ', ';

Create a business information dimension table
drop table if exists dim.dim_trade_shops;
create table dim.dim_trade_shops(
  `shopid` int COMMENT 'shops ID',
  `userid` int COMMENT 'Shop Owner', 
  `areaid` int COMMENT 'area ID',
  `shopname` string COMMENT 'Shop Name',
  `shoplevel` int COMMENT 'Shop Level',
  `status` int COMMENT 'Store Status',
  `createtime` string COMMENT 'Creation Date',
  `modifytime` string COMMENT  'Modify date',
  `startdate` string  COMMENT 'Effective Start Date',
  `enddate` string  COMMENT 'Expiration end date'
) COMMENT 'Merchant Information Sheet';

Copy the code

Import the following test data:

/root/data/shop- 2020.- 11- 20.dat
100050.1.100225,WSxxx Camp supermarket,1.1.2020- 06- 28.2020- 11- 20 13:22:22
100052.2.100236, fresh XXX flagship store,1.1.2020- 06- 28.2020- 11- 20 13:22:22
100053.3.100011, Huawei XXX Flagship store,1.1.2020- 06- 28.2020- 11- 20 13:22:22
100054.4.100159, Xiaomi XXX flagship store,1.1.2020- 06- 28.2020- 11- 20 13:22:22
100055.5.100211, Apple XXX Flagship store,1.1.2020- 06- 28.2020- 11- 20 13:22:22
 
 
/root/data/shop- 2020.- 11- 21.dat
100057.7.100311, three XXX mice snacks,1.1.2020- 06- 28.2020- 11- 21 13:22:22
100058.8.100329, Liangzi XXX Restaurant food,1.1.2020- 06- 28.2020- 11- 21 13:22:22
100054.4.100159, Xiaomi XXX flagship store,2.1.2020- 06- 28.2020- 11- 21 13:22:22
100055.5.100211, Apple XXX Flagship store,2.1.2020- 06- 28.2020- 11- 21 13:22:22
 
 
/root/data/shop- 2020.- 11- 22.dat
100059.9.100225, Leju XXX daily necessities,1.1.2020- 06- 28.2020- 11- 22 13:22:22
100060.10.100211, Tongren XXX Great Health,1.1.2020- 06- 28.2020- 11- 22 13:22:22
100052.2.100236, fresh XXX flagship store,1.2.2020- 06- 28.2020- 11- 22 13:22:22

load data local inpath '/root/data/shop-2020-11-20.dat' overwrite into table ods.ods_trade_shops partition(dt='2020-11-20');
load data local inpath '/root/data/shop-2020-11-21.dat' overwrite  into table ods.ods_trade_shops partition(dt='2020-11-21');
load data local inpath '/root/data/shop-2020-11-22.dat' overwrite  into table ods.ods_trade_shops partition(dt='2020-11-22');

Copy the code

3.2 Zipper table initialization

Let’s assume that the data from the first day is all the historical data

INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0.10)
           ELSE substr(createtime, 0.10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt ='2020-11-20';
Copy the code

3.3 Update the zipper table

For the increment table, the general logic is that the interception of CREATE_time or MODIfyTime is used as the day partition DT, modifyTime is greater than or equal to CREATE_time, here take the first two

INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0.10)
           ELSE substr(createtime, 0.10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '2020-11-21'
UNION ALL
SELECT b.shopid,
       b.userid,
       b.areaid,
       b.shopname,
       b.shoplevel,
       b.status,
       b.createtime,
       b.modifytime,
       b.startdate,
       CASE
           WHEN a.shopid IS NOT NULL
                AND b.enddate ='9999-12-31' THEN date_add('2020-11-21'.- 1)
           ELSE b.enddate
       END AS enddate
FROM
  (SELECT *
   FROM ods.ods_trade_shops
   WHERE dt='2020-11-21') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;
Copy the code

The script for loading the zipper table is as follows:

dim_load_shops.sh

#!/bin/bash
 
source /etc/profile
if [ -n "$1" ]
then
  do_date=$1
else
  do_date=`date -d "-1 day" +%F`
fi
 
sql="
INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       CASE
           WHEN modifytime IS NOT NULL THEN substr(modifytime, 0, 10)
           ELSE substr(createtime, 0, 10)
       END AS startdate,
       '9999-12-31' AS enddate
FROM ods.ods_trade_shops
WHERE dt = '$do_date'
UNION ALL
SELECT b.shopid,
       b.userid,
       b.areaid,
       b.shopname,
       b.shoplevel,
       b.status,
       b.createtime,
       b.modifytime,
       b.startdate,
       CASE
           WHEN a.shopid IS NOT NULL
                AND b.enddate ='9999-12-31' THEN date_add('$do_date', -1)
           ELSE b.enddate
       END AS enddate
FROM
  (SELECT *
   FROM ods.ods_trade_shops
   WHERE dt='$do_date') a
RIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid;
"
 
hive -e "$sql"
Copy the code

To load data from 2020-12-22, run the following script: sh dim_load_shops. Sh 2020-12-22

3.4 Roll back the zipper table to a certain time point

Create a temporary table, tmp.shops_tmp, to hold the rolled back data

DROP TABLE IF EXISTS tmp.shops_tmp;
CREATE TABLE IF NOT EXISTS tmp.tmp_shops AS
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       startdate,
       enddate
FROM dim.dim_trade_shops
WHERE enddate < '2020-11-21'
UNION ALL
SELECT shopid,
       userid,
       areaid,
       shopname,
       shoplevel,
       status,
       createtime,
       modifytime,
       startdate,
       '9999-12-31' AS enddate
FROM dim.dim_trade_shops
WHERE startdate < = '2020-11-21'
  AND enddate > = '2020-11-21';


INSERT OVERWRITE TABLE dim.dim_trade_shops
SELECT *
FROM tmp.tmp_shops;
Copy the code

The rollback script is similar to the update script, as long as the SQL in it is updated, which will not be repeated here.