Hive Zipper Table (47)

Zipper watch first acquaintance

You need to view the status of nodes at a certain time in history and consider the storage space. The data may change, but for the most part it stays the same

In the process of data warehouse data model design, often the design to the table: there are some watch the amount of data is very big, such as a user list, about 1 billion records, 50 fields, this kind of table, even using ORC compression, single table storage will be more than 100 g, in HDFS using double backups or three words. Some fields in the table will be updated, such as user contact information, product description information, order status, and so on.

You need to view the historical snapshot information of a point in time or period. For example, you need to view the status of an order at a point in time.

The percentage and frequency of changes recorded in the table is not very large. For example, out of a total of 1 billion users, about 2 million users are added and changed every day. The percentage of changes is very small

If you keep a full copy of this table every day, a lot of unchanged information will be saved in each full copy, which is a great waste of storage

Below is a zipper table that stores the most basic information about the user and the life cycle of each record. We can use this chart to get the latest data for the current day and the previous historical data.

The registration date	The user id	Mobile phone number	t_start_date	t_end_date
2017-01-01	001	111111	2017-01-01	9999-12-31
2017-01-01	002	222222	2017-01-01	2017-01-01
2017-01-01	002	233333	2017-01-02	9999-12-31
2017-01-01	003	333333	2017-01-01	9999-12-31
2017-01-01	004	444444	2017-01-01	2017-01-01
2017-01-01	004	432432	2017-01-02	2017-01-02
2017-01-01	004	432432	2017-01-03	9999-12-31
2017-01-02	005	555555	2017-01-02	2017-01-02
2017-01-02	005	115115	2017-01-03	9999-12-31
2017-01-03	006	666666	2017-01-03	9999-12-31

Description:

t_start_dateRepresents the start time of the life cycle of the record,t_end_dateIndicates the end time of the life cycle of the record;
t_end_date= ‘9999-12-31’ indicates that the record is valid.
If all current valid records are queriedselect * from user where t_end_date = '9999-12-31'
If the query2017-01-01Is the historical snapshot ofselect * from user where t_start_date <= '2017-01-01' and end_date >= '2017-01-01', this statement will find the following records:

Use scenario of zipper table

In the process of data model design of data warehouse, the following table design is often encountered:

Some tables have a large amount of data, such as a user table, about 1 billion records, 50 fields, even with ORC compression, the storage of a single table will be more than 100 GB, in HDFS with double or triple backup is larger.
Some fields in the table will be updated, such as user contact information, product description information, order status, and so on.
You need to view the historical snapshot information of a point in time or period. For example, you need to view the status of an order at a point in time.
The percentage and frequency of changes recorded in the table is not very large. For example, out of a total of 1 billion users, about 2 million users are added and changed every day. The percentage of changes is very small.

For this kind of watch design? There are several options:

Solution 1: Keep only the latest copy every day. For example, we use Datax to extract the latest full copy of data to Hive every day.
Plan 2: Keep one full slice data every day.
Plan three: use zipper table.

Why use a zipper watch

Plan 1: Keep only the latest portion each day

This solution is easy to implement. Drop the previous day’s data every day, and start again with the latest one. Advantages are obvious, save space, some common use is also very convenient, do not have to select a table when adding a time partition of what. The disadvantages are also obvious. Without historical data, old accounts can only be retrieved by other means, such as drawing from a flow sheet.

Plan 2: Keep one full slice data every day

A full slice per day is a safer bet, and historical data are available. The disadvantage is that the storage space is too large and too large. If you keep a full copy of the table every day, a lot of unchanged information will be saved in the full copy every time, which is a great waste of storage. Of course, we can also make some trade-offs, such as only keeping data for the last month? But demand is shameless, and the life cycle of data is not something we can completely control.

Scheme three: zipper table

The zipper table basically takes into account our needs in use. First of all, it makes a spatial trade-off. Although it’s not as small as Plan 1, its daily increment may be only 1/1000 or even 1/10,000 of Plan 2. In fact, it can meet the requirements of plan 2, which can not only get the latest data, but also add filtering conditions and get historical data. So we are still very necessary to use the zipper watch.

Zipper watch design

Information changes in the user table in the Mysql relational database

The data in the 2017-01-01 table are:

The registration date	The user id	Mobile phone number
2017-01-01	001	111111
2017-01-01	002	222222
2017-01-01	003	333333
2017-01-01	004	444444

As shown in table 2017-01-02, the data of users 002 and 004 have been modified, and 005 is the new user:

The registration date	The user id	Mobile phone number	note
2017-01-01	001	111111	There is no
2017-01-01	002	233333	(from 222222 to 233333)
2017-01-01	003	333333	There is no
2017-01-01	004	432432	(from 444444 to 432432)
2017-01-02	005	555555	(Added in 2017-01-02)

As shown in table 2017-01-03, the data of users 004 and 005 have been modified, and 006 is the new user:

The registration date	The user id	Mobile phone number	note
2017-01-01	001	111111
2017-01-01	002	233333
2017-01-01	003	333333
2017-01-01	004	654321	(from 432432 to 654321)
2017-01-02	005	115115	(from 555555 to 115115)
2017-01-03	006	115115	(2017-01-03 added)

If you were to save the table as a historical zip table in a data warehouse, you would have the following table, which is the latest date (2017-01-03) :

The registration date	The user id	Mobile phone number	t_start_date	t_end_date
2017-01-01	001	111111	2017-01-01	9999-12-31
2017-01-01	002	222222	2017-01-01	2017-01-01
2017-01-01	002	233333	2017-01-02	9999-12-31
2017-01-01	003	333333	2017-01-01	9999-12-31
2017-01-01	004	444444	2017-01-01	2017-01-01
2017-01-01	004	432432	2017-01-02	2017-01-02
2017-01-01	004	432432	2017-01-03	9999-12-31
2017-01-02	005	555555	2017-01-02	2017-01-02
2017-01-02	005	115115	2017-01-03	9999-12-31
2017-01-03	006	666666	2017-01-03	9999-12-31

Description:

t_start_dateRepresents the start time of the life cycle of the record,t_end_dateIndicates the end time of the life cycle of the record;
t_end_date = '9999-12-31'Indicates that the record is currently in valid state;
If all current valid records are queriedselect * from user where t_end_date = '9999-12-31'
If the query2017-01-01Is the historical snapshot ofSelect * from user where t_start_date <= '2017-01-01' and end_date >= '2017-01-01'.

The realization and update of zipper table

Implement zipper tables in Hive

Need aODSLayer of user full scale.
Daily users are required to update the table, known as the delta table

In addition, we need to determine the time granularity of the zipper table. For example, the zipper table only takes one state every day. That is to say, if there are three state changes in a day, we only take the last state.

Get daily increments

We can synchronize today’s change data (add + modify) based on a timestamp, update time or create time, which is usually the actual update time

Table structure

User_update table for the ODS layer

CREATE EXTERNAL TABLE ods.user_update (
  user_num STRING COMMENT 'User number',
  mobile STRING COMMENT 'Mobile number',
  reg_date STRING COMMENT 'Registration Date'
COMMENT 'Daily User Information Update Table'
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/ods/user_update';)Copy the code

CREATE EXTERNAL TABLE ods.user_his (
  user_num STRING COMMENT 'User number',
  mobile STRING COMMENT 'Mobile number',
  reg_date STRING COMMENT 'User number',
  t_start_date ,
  t_end_date
COMMENT 'User Data Zipper Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/dws/user_his';)Copy the code

Update (zipper)

Suppose you now need to update the 2017-01-02 data

INSERT OVERWRITE TABLE ods.user_his
SELECT * FROM
(
    SELECT A.user_num,
           A.mobile,
           A.reg_date,
           A.t_start_time,
           CASE
                WHEN A.t_end_time = '9999-12-31' AND B.user_num IS NOT NULL THEN '2017-01-01'
                ELSE A.t_end_time
           END AS t_end_time
    FROM ods.user_his AS A
    LEFT JOIN ods.user_update AS B
    ON A.user_num = B.user_num
UNION ALL
    SELECT C.user_num,
           C.mobile,
           C.reg_date,
           '2017-01-02' AS t_start_time,
           '9999-12-31' AS t_end_time
    FROM ods.user_update AS C
) AS T
Copy the code

In this, some time is written. In fact, in the data warehouse, it mainly refers to the specific business time. For example, if the business time is PT, we can replace the relevant time

So one of the problems with this is that some people are using a UNION instead of a UNION ALL when they’re merging data and we can ALL see that they should be using a UNION ALL because they don’t have to be duplicated, although the UNION can be used because there’s no duplication of data, But there’s a performance cost and it’s not in keeping with our forbidden habits.

supplement

How to track records more

From a warehouse point of view we can do higher frequency zipper, such as hourly synchronization data zipper, but if we want to pursue complete state change, then this time-driven approach may not be so suitable
From the business point of view to record the overall state change, that is, from the business system design to capture the state change

Zipper meter and flow meter

A flow table stores a user’s change records. For example, in a flow table, every change record of a user is stored in the data of a day, but in a zipper table, there is only one record. This is a particle size problem that needs to be paid attention to when zipper table design. We could certainly set the granularity to be smaller, but as we said earlier, if you want complete state changes, counting compartments is not appropriate, and you may lose some state even if you zipper once a minute.

Zipper table performance optimization

Of course, the zipper table will also encounter performance problems. For example, we have stored five years of zipper data, so this table is bound to be relatively large, mainly with the following two performance problems

Querying performance Problems
Zipper (update) performance issues

But whether the query problem or update performance problems or the main cause of the table is too large, as long as we understand this problem we can very well solve, we can split the zipper table, for example, we can according to t_END_time is not ‘9999-12-31’ split into two parts, Select * from no_update table where t_end_time = ‘9999-12-31’; select * from no_update table where t_end_time = ‘9999-12-31’

INSERT OVERWRITE TABLE ods.user_no_update select * from ods.user_his where t_end_time! = '9999-12-31'; INSERT OVERWRITE TABLE ods.user_his select * from ods.user_his where t_end_time='9999-12-31' ;Copy the code