Guide language | tencent internal daily need for vast amounts of game marketing data analysis of the effect, and activity participation to heavy has been a difficult point. This paper will introduce Tencent game marketing activity analysis system — Yixing, in the re-service of the technical thinking and iteration plan, I hope to communicate with you to discuss. Article author: Wang Fangxiao, Tencent operations development engineer.

The background,

EAS is Tencent’s internal system focusing on the analysis of game marketing activities. In the analysis of the effect of marketing activities, one of the biggest problems encountered by EAS is the weight of the number of participants and the corresponding activity number package. The cycles of individual marketing campaigns are fixed, but the time between campaigns is rarely exactly the same.

For example, if A’s activity time is 1-10 and B’s activity time is 5-15, then in order to obtain the number of participants of A and B, the task must be started to calculate the de-gravity of A and B within their respective time interval. Doing similar calculations for thousands of campaigns a day in a massive log is a challenge in terms of resource consumption.

In fact, the amount of tasks to be calculated is far more than that. At the same time, Yi Xing also provides the de-duplication data of inactive links on the official website of the game, and even the de-duplication data of each link in each promotion channel. At present, the magnitude of this task is as high as more than 50W+ per day.

In summary, the main problem we face is how to deal with a huge number of de-recalculation tasks with different cycles under the condition of massive data.

Ii. Original solutions

For this problem, yi Xing tried many programs before, here is a brief introduction.

1. Schemes based on TDW temporary tables

TDW is Tencent’s universal one-stop big data platform with stable services and powerful functions. For the calculation of these tasks, Yixing first wanted to complete them with the help of TDW’s computing power.

The idea is very simple, is in pYSQL loop for each activity to execute the corresponding hiveSQL to complete the calculation of T+1 time.

However, the biggest disadvantage of this solution is that the tasks are executed in sequence and a large number of logs are repeatedly scanned, which leads to low efficiency. In addition, it takes a lot of time to pull the final deduplicated files from HDFS.

Although various ideas of optimization were adopted later, for example, the original logs were first re-entered into the temporary table in a unified batch every day, and all calculations were completed based on the temporary table, etc., the efficiency could not be further improved, so it was abandoned.

2. Based on real-time calculation + file incremental deduplication scheme

In the data statistics of Yixing, the real-time calculation task based on Storm mainly provides the real-time PV and the number of participation and other counting data of each activity.

Due to limited memory resources, the industry also has an approximate recalibration algorithm (such as Hyperloglog) to directly calculate the approximate recalibration result in Storm, but the exact result and the final number package file cannot be given, so it does not meet the selection requirements.

However, memory resources are limited, and it is impossible to hold such a large amount of number packet data, so it is basically unrealistic to obtain the final accurate result of deduplication through memory.

Although the memory cannot contain the number data during the whole activity period or within a day, can it contain the number data for 1 minute or 5 minutes?

Test calculations show that it is entirely feasible to cache the de-dialling data within 5 minutes in memory, and the original log can be reduced by up to 90% of the magnitude. A 1-minute cache can reduce the original log by up to 70%.

Players to participate in activities is the main reason is the immediate participation behavior, such as a player came after a campaign page, is usually continuous functions are involved in can participate in the activities, such as finish will not be involved in a long time to participate in the next, so lead to the same player log time continuity is higher, the unit within the time window to heavy scale will reduce a lot.

Based on this, the current Yixing is mainly based on Storm’s initial de-weighting in a unit time window, in order to reduce the magnitude of the original data.

In addition to repeated scanning and other problems, there was another problem in the original TDW-based deduplication scheme: For example, during the period of activity A (1st to 10th), the calculation logic of each day is basically the same, which requires full scanning of logs (or intermediate results) between 1st and 10th to complete the calculation.

Therefore, the team focused on how to increase the weight loss on the basis of weight loss in the early stage of the activity. Finally selected scheme is computing scheme based on file, as shown in the figure below, the activity every day scroll to generate the latest package file to heavy number, and the next day the same activity number with the total package file to be updated number when cross package, repeat, until the end of the activity to get the final number package files.

3. Based on real-time computing +LevelDB incremental de-redo scheme

After running for a period of time, there is a big problem: the number of newly added files is huge every day, hundreds of thousands.

Although the inode of a single machine is not fully occupied, a large number of small file IO operations lead to a very low efficiency of incremental deduplication. Finally, it is forced to support only high-priority activities and large activities with more than a certain threshold.

After the investigation of my team mates, I finally set my eyes on Google LevelDB, which is Google’s open source persistent KV stand-alone database with high random write and sequential read/write performance, but the random read performance is very average.

That said, LevelDB is well suited for scenarios where there are few queries but many writes, which is exactly what we want to do with the number packet deduplication service.

In addition, the number packet store itself is also a K-V store, the file name is key, the content of the file is value, just like LevelDB support k-V structure.

Using LevelDB, you can obtain the exact number of active users in milliseconds and export tens of millions of number packet files in 10 seconds. Compared with traditional file operations, the overall efficiency of number packet deduplication service is greatly improved.

Clickhouse-based solution

Although leveldb-based de-duplication service can well meet the needs of most marketing activities de-duplication. However, it is similar to OLAP system based on predictive computing mode because of poor scalability and difficulty in data tracing. For example, the system only supports the calculation of the number of deweights during the whole period of the activity. If you want to know the number of deweights during a certain period of time during the activity, it cannot be realized.

In addition, if an activity introduces dirty data, it can only delete the k-V structure of the entire activity and run again, which is very time-consuming. After investigation, the team focused on OLAP solution based on MPP.

OLAP system based on MPP, in the Shared database cluster, each node has an independent disk storage system and memory system, the business data according to the division database model and the application characteristics on each node, each data node through private network or commercial general network connected to each other, each other collaborative computing, database service as a whole.

Compared with OLAP system based on predictive computing mode, its biggest advantage is flexible and expansibility, but its biggest disadvantage is not timely response time, and even need a long time to wait.

In the effect analysis of marketing activities, flexibility is often more important than efficiency, or efficiency can be compromised a little, so we choose OLAP system based on MPP.

There are many good OLAP systems on the market, but they are either fee-based (Vertica) or hadoop-ecology-based (Presto, Impala), and the overall architecture is heavy.

ClickHouse has its own file system, a very high compression ratio, can even be deployed in clusters without zK, and can even beat commercial OLAP systems in performance (see official review data) : Clickhouse. Tech/benchmark/d… .

SQL > select * from ClickHouse to query the url of a page on LOL website on September 6th.

The select uniqExact (uvid) from tbUv where the date = '2020-09-06' and url = 'http://lol.qq.com/main.shtml';Copy the code

On a machine with 24 cores and 96 gigabytes of memory, out of 100 million records, it takes less than 0.1 seconds to accurately reproduce a 100W activity, and less than 0.2 seconds to export the number packet file.

LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB: LevelDB

Four, conclusion

The problem of deduplicated service accompanies the whole development and operation cycle of Yixing System, during which many attempts have been made. Some temporary attempts have not been listed in this paper, but the starting point is the business demand itself, and the corresponding technical solutions are selected based on the operation and maintenance environment at that time.

Instead of absolute high performance (which means high cost), focus on features that are most suitable, easy to expand, relocate, and replace faults, which are conducive to the long-term stability of the service. Of course, with the emergence of more tools, we will continue to try to use, after all, science and technology is the primary productivity.

ClickHouse is currently used in multiple data systems such as Yixing and many personalized marketing analytics scenarios, totaling more than 500 billion dollars in data and growing.

This article provides a brief overview of ClickHouse through a historical review of ClickHouse de-duplication issues. For more details and practices on ClickHouse, check out the official documentation and other shared content, and welcome to the comments section for more detailed ClickHouse questions and application scenarios.