How do you analyze billions of audience data in 30 minutes

Smart Marketing Cloud (hereinafter referred to as SMC) is a digital Marketing platform launched by TalkingData. Relying on the strong data support provided by TalkingData and its partners, TalkingData provides a set of integrated solutions ranging from crowd construction, customer insight, to simultaneous delivery and objective monitoring, helping enterprises to build a complete digital marketing closed-loop.

1. Demand and difficulties of audience analysis

SMC serves advertisers and advertising agencies across a wide range of industries to provide analysis, insight and access to their target audiences. However, because SMC collects multi-source data including enterprise data of one party, media data of the other party and TalkingData’s own data, the amount of data is very large. In addition, in order to make a comprehensive and in-depth portrait of the audience, TalkingData has established a labeling system with six categories and more than 800 labels based on demographic attributes and mobile terminal behavior preferences, with a large number of dimensions. This poses a huge challenge to data processing and analysis.

In the concrete use, the product performance is the aspect that the enterprise attaches great importance to. In order to improve SMC’s performance and enable users to quickly and accurately achieve target audience insights, we have optimized SMC’s audience analysis capabilities in three aspects from a technical perspective:

II. Application of technical principles and schemes

In SMC, due to the huge amount of data, we use RoaringBitmap to store all the audiences that advertisers build. Since RoaringBitmap can only store integer data, and the amount of data we need to process is in the billions in most cases, we extended RoaringBitmap to support long integer data.

The native RoaringBitmap only supports int type, and the maximum data storage capacity is 2147483647. Since the TalkingData device has about 8 billion data, it is far beyond the storage range of RoaringBitmap, so it needs to use long integer type to expand RoaringBitmap.

Taking the set(long) method as an example, the addressing method might look something like this:

public void set(long offset) {
    int index = (int) (offset / max());
    int value = (int) (offset % max());
    bitmaps.get(index).set(value);
}

After the expansion of RoaringBitmap, has obtained a better storage and read speed. But this is just the beginning. Multidimensional analysis and calculation of the population data will follow.

RocksDB accelerated SMC audience analysis dimensions include: population attribute dimension, device attribute dimension, business travel attribute, APP behavior analysis, etc. Analysing an AD audience package based on the above dimensions requires about 100,000 Bitmap crossover operations, and the CPU and I/O of the system becomes the bottleneck. Therefore, RocksDB was used for Bitmap caching to reduce I/O consumption.

RocksDB relies on a number of flexible configurations that allow it to be tuned for different production environments, including direct memory, Flash, hard disk, or HDFS. Support for different compression algorithms, and a complete set of tools for production and debugging.

The advantages of RocksDB are as follows:

Designed for application servers that need to store terabytes of data to local Flash or RAM
Optimized for small and medium key values stored on high-speed devices — support for Flash or direct memory storage
Performance increases linearly with the number of CPUs and is friendly to multi-core systems

RocksDB supports snappy, zlib, bzip2 lz4 and lz4_hc compression algorithms. Different compression algorithms can be configured for different layers of data. In general, 90% of the data is stored in the LMAX layer. A typical installation might be L0-L2 with no compression algorithm configured, the middle layer with Snappy compression algorithm, and the LMAX layer with zlib compression. With RocksDB, I/O performance has improved significantly, and tasks that used to take more than 3 hours to complete can now be completed in 1.5 hours.

But that was still too long to bear, so we came up with the idea of sampling the system data to speed things up.

Random sampling algorithm Random sampling is one of the most commonly used algorithms. Its biggest feature is that it can infer the overall characteristics of data as objectively as possible by extracting and calculating a small sample size of data.

We need to conduct random sampling and keep it in order. When the total number of devices is n, m devices need to be randomly selected, where m < n. The output is an ordered list of m random integers in the range [0, n-1], with no repetition allowed. From a probabilistic point of view, we want to have ordered choices without repetition, where each choice is equally likely to occur. In simple terms, we randomly select m data from n numbers and keep them in order.

The probability (m/n) of each number in the list composed of n numbers is judged in turn. After each judgment, n=n-1. If the number currently judged is selected, m=m-1; otherwise, m remains unchanged.

Implementation method:

public static Set<Long> random(long n,int m){ Set<Long> set = new TreeSet<Long>(); long remaining = n-1; for (long i = 0; i<n ; i++){ if (Math.random() * remaining < m){ set.add(i); m -= 1; } remaining -= 1; } return set; }

We used the secondary method to randomly select the audience from the total number of devices and analyze the sample data and process it into Bitmap. We assume that the Bitmap is A and the full male data is M, then the formula for calculating the male proportion P in the population X is:

There is still some bias in the proportion obtained by random sampling. After comparing 50 groups of randomly constructed audience groups and analyzing the gender ratio, the relative error rate was not more than 8%, which was within the acceptable range.

After random sampling calculation, Bitmap data occupied significantly less RocksDB storage, and Bitmap calculation efficiency was significantly improved. The audience analysis task with billions of data volume could be calculated and completed within 30 minutes.

Based on these optimizations, intelligent marketing cloud can quickly complete the analysis of the advertising audience, so that advertisers in the whole process of advertising in a timely manner to understand their target audience characteristics and distribution, so as to guide advertisers to timely adjust the advertising audience group.

Author: TalkingData Chen Hailong

Click here to apply for a trial of TalkingData Smart Marketing Cloud

How do you analyze billions of audience data in 30 minutes

1. Demand and difficulties of audience analysis

II. Application of technical principles and schemes

Related Posts

How to Pay Back Technical Debt?