How to filter data from any dimension of ten billion tables in milliseconds...

With the development of idle fish business, the scale of users has reached hundreds of millions, and the data indicators of user dimension have reached hundreds. How to quickly screen out the user groups that meet the expectations from the data of hundreds of millions of levels, and carry out fine crowd operation is a problem that the technology needs to solve. Many solutions in the industry require minutes or even hours to generate query results. This paper provides an efficient data screening, statistics and analysis method to solve the big data scenario. From the data of hundreds of millions of levels, any combination of query conditions can be used to screen the data needed, and the data can be returned in milliseconds.

From a technical perspective, our business scenario has the following characteristics:

Need to support the combination of arbitrary dimensions (and/ OR) nested query, and require low latency;
Large data scale, at least 100 million levels, and need to support continuous expansion;
A single data index has multiple dimensions, at least hundreds, and it needs to be continuously increased. Taken together, this is a typical OLAP scenario.

OLTP and OLAP

Here is a simple comparison between OLTP and OLAP:

Data volume bottleneck: mysql is suitable for millions of data levels, more than that, query and write performance will significantly degrade. Therefore, the method of database and table is generally adopted to control the data scale in millions.
Query efficiency bottleneck: mysql needs to create an index or a combination of indexes for common conditional queries. Non-indexed field queries require scanning the entire table, resulting in significant performance degradation.

In summary, our application scenario is not suitable for the use of row storage database, so we focus on the column storage database.

Row storage versus column storage

The following is a simple comparison of the characteristics of downlink storage and column storage:

HybridDB for MySQL computing Specifications introduction

HybridDB for MySQL computing specifications for our scenario, the core competencies are:

Smart composite index of any dimension (users do not need to build their own indexes)
10 billion large table query millisecond response
MySql BI ecological compatibility, complete SQL support
Spatial search, full-text search, complex data types (multi-valued columns, JSON) support

So how does HybridDB for MySQL compute specifications achieve millisecond response of any dimensional combination query in big data scenarios?

First, HybridDB’s high-performance column storage engine, built into the storage of predicate computing ability, can use a variety of statistics to quickly skip data blocks to achieve fast filtering;
The second one is HybridDB’s intelligent index technology, which can automatically complete index with one key on the large and wide table and intelligently combine various predicate conditions according to the column index for filtering;
The third is the high performance MPP+DAG fusion computing engine, taking into account the high concurrency and high throughput two modes to achieve high performance vector calculation based on pipeline, and the computing engine and storage closely cooperate, so that the calculation is faster;
Fourthly, HybridDB supports a variety of data modeling techniques such as star model, snowflake model, aggregation sorting, etc. Business-appropriate data modeling can achieve better performance indicators.

Generally speaking, HybridDB for MySQL calculation specification is a SQL-centered multi-functional online real-time warehouse system, which is very suitable for our business scenarios, so we build our crowd selection bottom engine on this basis.

After building the crowd selection engine, we focused on the transformation of our message push system as an important drop-off point for crowd refinement operation.

Free fish message push introduction

Message PUSH is the fastest way for information to reach users. The commonly used PUSH method of idle fish is to calculate the PUSH crowd offline and prepare the corresponding PUSH document, and then PUSH it at the specified time on the next day. These are usually periodic PUSH tasks. However, temporary, urgent PUSH tasks that need to be sent immediately need the intervention of BI students. Each PUSH task takes up about half a day of BI students’ development time on average, and the operation is quite troublesome. This time, we integrated the crowd selection system with the original PUSH system, greatly improving the data preparation and transmission efficiency of such PUSH, and freeing up development resources.

System architecture

Offline data layer: User-dimension data is scattered in the offline tables of each service system. We import the data summary into the user wide and large tables in the real-time computing layer through the offline T+1 scheduled task.

Real-time computing layer: according to the screening conditions of the crowd, query the number of users and the list of user IDS that meet the requirements from the user large and wide table, and provide services for the application system.

Crowd selection front desk system: provides visual operation interface. Operation students select screening conditions and save them as crowds for analysis or send PUSH. For each population, there is an SQL store. Similar to: Select count(*) from userbigtable where column1> 1 and column2 in (‘a’,’b’) and (column31=1 or column32=2) SQL can support multiple and/or nested combinations of arbitrary fields. Use SQL to save the crowd, when the data in the user table changes, you can execute SQL at any time to obtain the latest crowd users, to update the crowd.

Idle fish PUSH system: obtains the where conditions corresponding to the crowd from the front stage system of crowd circle selection, obtains the user list in pages from the real-time computing layer, and sends PUSH to the user. In the implementation process, we focus on solving the performance problem of paging query.

Paging query performance optimization scheme: when paging, when the scale of the crowd is very large (tens of millions of levels), the further the page number, the query performance will have a significant decline. Therefore, we increase the line number of the crowd data and export it to MySql to improve performance. The table structure is as follows:

Batch number: each time the population is exported, a new batch number will be added. The batch number is time-stamped and increasing.
Line number: incrementing from 1, each lot number corresponds to a line number from 1 to N.

We build a combined index for “population ID”+” batch number “+” line number”, and replace the paging method with index query when paging query, so as to ensure the query efficiency when large page number.

In addition, the extra cost of exporting data is due to the powerful data exporting ability of HybridDB. The data volume ranges from ten thousand to one million levels and the time consumption ranges from seconds to tens of seconds. After comprehensive weighing, this scheme is adopted.

The crowd selection system provides a strong bottom capacity support for the operation of xianyu fine users. At the same time, crowd selection can also be applied to other business scenarios, such as home page focus map selection and other scenarios requiring hierarchical user operation, which provides a large space for optimization of idle fish business.

This paper realizes the second-level return result of combined query in massive multi-dimensional data, which is a general technical implementation scheme in OLAP scenario. At the same time, an application case of reforming the original business system with this technical scheme is introduced, and a good business result is obtained, which can be used as a reference for similar requirements or scenarios.

The user data in the crowd selection engine is currently imported by T+1. This is in consideration of population-related indicators, which change frequency is not very fast, and many indicators (such as user tags) are calculated by offline T+1, so the data update frequency of T+1 is acceptable. Then we build a more powerful commodity selection engine based on HybridDB. Free fish commodity data changes faster than user data. On the one hand, users can update their products at any time. On the other hand, due to the characteristics of idle fish product list inventory (sold immediately removed), and other reasons, the status of products can change at any time. Therefore, our selection engine should be aware of these data changes as soon as possible, and make real-time adjustments at the delivery level. Based on HybridDB(storage) and real-time computing engine, we build a more powerful “Mach” real-time selection system. Has launched “Mach” series of articles, interested students can pay attention to. In addition, if you have any questions about the specific technical implementation (details) in this article, please contact us. thank you

Resources: HybridDB for MySQL is introduced: https://www.aliyun.com/product/petadata

Qcon2018·Flutter&Dart three-terminal integrated development

GDD2018·TensorFlow application “UI 2 Code”

Thousands of online problem playback technology revealed

2018 Double 11· Real-time selection of top goods and excellent products — “Mach”

Pay attention to two-dimensional code, forward-looking technology is in control

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How to filter data from any dimension of ten billion tables in milliseconds…

How to filter data from any dimension of ten billion tables in milliseconds…

Related Posts

Five open Source Python Web frameworks for YYDS are recommended

Cloud map said | don’t look down upon the humble log, “small log, big effect”

Ethereum design and Implementation series one: Development environment building