Brief introduction:This paper will introduce Hologres to realize ultra-high cardinality UV calculation based on RoaringBitmap

RoaringBitmap is a compressed bitmap index. The data compression and deduplication features of RoaringBitmap are very suitable for UV calculation with big data. Its main principles are as follows:

  • For 32 bits, RoaringBitmap will construct 2^16 buckets, corresponding to the high 16 bits of 32 bits. The lower 16 bits of the 32 bits are mapped to a bit in the corresponding bucket. The capacity of a single bucket is determined by the maximum value already in the bucket
  • Bitmap represents 32 bits in 1 bit, which can greatly compress the data size.
  • Bitmap operations provide a means of de-redoing.

Body idea (T+1) : Put the UID results aggregated from all the data of the previous day according to the largest query dimension into RoaringBitmap, and store the RoaringBitmap and query dimension in the aggregate result table (million entries per day). Later, when querying, the powerful inventory calculation of Hologres is used to directly query the aggregate result table according to the query dimension. After the OR operation is performed on the key RoaringBitmap field, and the cardinality is counted, the corresponding user number UV can be obtained, and the number of counts can be calculated and PV can be obtained, reaching the sub-second level query.

Only one finest-grained prepolymerization calculation is needed, and only one finest-grained prepolymerization result table is generated. Thanks to the real-time computing capability of Hologres, the number of times and space required for the predicted calculation in this scheme can achieve a lower overhead.

Hologres calculates details of UV and PV schemes

Fig. 1 Hologres calculates PV UV flow based on RoaringBitmap

1. Create related base tables

1) Before using RoaringBitmap, RoaringBitmap Extention needs to be created with the syntax as follows. Meanwhile, this function needs Hologres 0.10 version.

CREATE EXTENSION IF NOT EXISTS roaringbitmap;

Create table ods\_app as the detail source table, store a large amount of detail data (partition by day), its DDL is as follows:

BEGIN; CREATE TABLE IF NOT EXISTS public.ods_app ( uid text, country text, prov text, city text, channel text, operator text, brand text, ip text, click_time text, year text, month text, day text, ymd text NOT NULL ); CALL set_table_property('public.ods_app', 'bitmap_columns', 'country,prov,city,channel,operator,brand,ip,click_time, year, month, day, ymd'); -- Distribution_key is set based on demand, and based on the real-time query needs of the table, CALL set_table_property('public.ods_app', 'distribution_key', 'uid'); -- used as where filter condition, Clustering_key and event_time_column CALL set_table_property('public. Ods_app ', 'clustering_key', 'ymd'); CALL set_table_property('public.ods_app', 'event_time_column', 'ymd'); CALL set_table_property('public.ods_app', 'orientation', 'column'); COMMIT;

3) Create table UID \ _MAPPING as UID mapping table. UID mapping table is used to map UID to 32-bit int type.

The RoaringBitmap type requires that the user ID must be 32-bit int and as dense as possible (user IDs should be contiguous), and many user IDs in common business systems or buried sites are strings, so use the uid\_mapping type to build a mapping table. The mapping table utilizes Hologres SERIAL type (self-incrementing 32-bit int) to achieve automatic management and stable mapping of user mappings.

Note: This table can be saved as a row table or as a column table in the daily batch write scenario in this example. It doesn’t make much difference. If you need to do real-time data (such as with Flink), you need to be a row save table, in order to improve the QPS of Flink dimension table real-time JOIN.

BEGIN; CREATE TABLE public.uid_mapping ( uid text NOT NULL, uid_int32 serial, PRIMARY KEY (uid) ); Call set_table_property(' public-uid_mapping ', 'clustering_key', 'clustering_key', 'uid'); CALL set_table_property('public.uid_mapping', 'distribution_key', 'uid'); CALL set_table_property('public.uid_mapping', 'orientation', 'row'); COMMIT;

3) Create table DWS \ _APP base aggregate table to store the aggregated results on the base dimension

The basic dimension is the smallest dimension to calculate PV and UV after query. Country, Prov and City are taken as examples to build the aggregate table

begin; Create table dws_app(country text, prov text, city text, ymd text NOT NULL, uid32_bitmap roaringbitmap, -- pv count primary key(country, prov, city, ymd)-- query dimension and time as primary keys to prevent repeated insertion of data); CALL set_table_property('public.dws_app', 'orientation', 'column'); Call set_table_property('public. DWS_APP ', 'clustering_key', 'ymd'); CALL set_table_property('public.dws_app', 'event_time_column', 'ymd'); Set to group by CALL set_table_property('public.dws_app', 'distribution_key', 'country,prov,city'); end;

2. Update DWS table and ID \ _MAPPING table

Each day a new customer is found from the previous day’s UID (UID that is not in UID mapping table UID \_mapping) and inserted into the UID mapping table

SELECT user_ids AS (SELECT uid FROM ods_app WHERE ymd = '20210329' GROUP BY uid) AS ( SELECT user_ids.uid FROM user_ids LEFT JOIN uid_mapping ON (user_ids.uid = uid_mapping.uid) WHERE uid_mapping.uid IS NULL ) INSERT INTO uid_mapping SELECT new_ids.uid FROM new_ids ;

After updating the UID mapping table, insert the aggregate result table after the data is aggregated. The main steps are as follows:

  • Firstly, the previous day’s aggregation conditions and the corresponding uid\_int32 are obtained from the INNER JOIN UID mapping table.
  • Then do the aggregation operation according to the aggregation conditions and insert the RoaringBitmap aggregation result table as the aggregation result of the previous day;
  • Only need to aggregate once a day, store one piece of data, the number of data is the worst is equal to the amount of UV. Taking the case as an example, the increment of several hundred million in the detail table every day only needs to store millions of data in the aggregate result table every day.
WITH
    aggregation_src AS( SELECT country, prov, city, uid_int32 FROM ods_app INNER JOIN uid_mapping ON ods_app.uid = uid_mapping.uid WHERE ods_app.ymd = '20210329' )
INSERT INTO dws_app SELECT  country
        ,prov
        ,city
        ,'20210329'
        ,RB_BUILD_AGG(uid_int32)
        ,COUNT(1)
FROM    aggregation_src
GROUP BY country
         ,prov
         ,city
;

3.UV and PV query

When querying, perform aggregation calculation according to the query dimension from the summary table DWS \ _APP, query the bitmap cardinality, and obtain the number of users under the condition of GROUP BY

Set hg_experimental_enable_force_three_stage_agg= OFF -- You can select any combination of the base dimensions you like, Uv pv SELECT country,prov,city,RB_CARDINALITY(RB_OR_AGG(uid32_bitmap)) AS uv,sum(1) AS pv FROM dws_app WHERE ymd = '20210329' GROUP BY country ,prov ,city; SELECT country,prov,RB_CARDINALITY(RB_OR_AGG(uid32_bitmap)) AS uv,sum(1) AS pv FROM dws_app WHERE ymd >= '20210301' and ymd <= '20210331' GROUP BY country ,prov; SELECT COUNTRY, PROV, CITY,COUNT(DISTINCT UID) AS UV,COUNT(1) AS PV FROM ODS_APP WHERE YMD = '20210329' GROUP BY country ,prov ,city; SELECT country ,prov ,COUNT(DISTINCT uid) AS uv ,COUNT(1) AS pv FROM ods_app WHERE ymd >= '20210301' and ymd <= '20210331' GROUP BY country ,prov;

4. Visualize

Calculate UV, PV and, in most cases, need to use BI tools to display in a more intuitive visual way. Since the need to use RB\ _Cardinality and RB\_OR\_AGG for aggregation calculation, the ability to use BI’s custom aggregation function is required. Common BI tools with this capability include Apache SuperSet and Tableau, and best practices for these BI tools are described below.

4.1 Use Apache SuperSet

Refer to the product manual for Apache SuperSet’s approach to Hologres. The DWS \ _APP table can be used as a DATASET directly in a Superset

And within the data set, create a separate Metrics called UV with the following expression:

RB_CARDINALITY(RB_OR_AGG(uid32_bitmap))

You are then ready to explore the data

You can also create Dashborad:

4.2 use Tableau

Please refer to the product manual for Tableau docking Hologres. The ability to implement custom functions directly using Tableau’s pass-through functions is described in Tableau’s manual. After Tableau docks the Hologres, you can create a computed field with the following expression

RAWSQLAGG_INT("RB_CARDINALITY(RB_OR_AGG(%1))", [Uid32 Bitmap])
You are then ready to explore the data

You can, of course, create Dashborad

Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.